rasyosef commited on
Commit
c2f9933
·
verified ·
1 Parent(s): 77870be

Add new SparseEncoder model

Browse files
Files changed (4) hide show
  1. README.md +126 -174
  2. config.json +1 -1
  3. config_sentence_transformers.json +1 -1
  4. model.safetensors +1 -1
README.md CHANGED
@@ -1,60 +1,38 @@
1
  ---
2
- language:
3
- - en
4
- license: mit
5
  tags:
6
  - sentence-transformers
7
  - sparse-encoder
8
  - sparse
9
  - splade
10
  - generated_from_trainer
11
- - dataset_size:496123
12
  - loss:SpladeLoss
13
  - loss:SparseMarginMSELoss
14
  - loss:FlopsLoss
15
- base_model: prajjwal1/bert-tiny
16
  widget:
17
- - text: >-
18
- Hurley doesn't just want to be your go-to for surf gear, but the be the
19
- brand that represents your lifestyle. Of course you have your pick up board
20
- shorts, tanks and a Hurley hat while you're on the beach, but you can also
21
- look at graphic tees, sandals, and accessories when you're on the street.
22
- - text: >-
23
- Electric field of a positive and a negative point charge. Electric charge is
24
- the physical property of matter that causes it to experience a force when
25
- placed in an electromagnetic field.There are two types of electric charges:
26
- positive and negative.lectric charge is a characteristic property of many
27
- subatomic particles. The charges of free-standing particles are integer
28
- multiples of the elementary charge e; we say that electric charge is
29
- quantized. Michael Faraday, in his electrolysis experiments, was the first
30
- to note the discrete nature of electric charge.
31
- - text: >-
32
- The term mechanical digestion refers to the physical breakdown of large
33
- pieces of food into smaller pieces which can subsequently be accessed by
34
- digestive enzymes. In chemical digestion, enzymes break down food into the
35
- small molecules the body can use.
36
- - text: >-
37
- Kids and Quick Solutions. Children learn to put away their clothes when they
38
- can reach the hanging rods. This is actually fun for little ones -- they may
39
- spend a long stretch of time putting hangers on and taking them off the rods
40
- -- as long as the rods are child-height.So take your stand against piles of
41
- clothes on the floor of the teen's bedroom early by re-sizing the closet to
42
- fit the kid.his is actually fun for little ones -- they may spend a long
43
- stretch of time putting hangers on and taking them off the rods -- as long
44
- as the rods are child-height. So take your stand against piles of clothes on
45
- the floor of the teen's bedroom early by re-sizing the closet to fit the
46
- kid.
47
- - text: >-
48
- About EUS (endoscopic ultrasound). An EUS, or endoscopic ultrasound, is an
49
- outpatient procedure used to closely examine the tissues in the digestive
50
- tract. The procedure is done using a standard endoscope and a tiny
51
- ultrasound device.The ultrasound sensor sends back visual images of the
52
- digestive tract to a screen, allowing the physician to see deeper into the
53
- tissues and the organs beneath the surface of the intestines.. In general,
54
- an EUS is a very safe procedure. If your procedure is being done on the
55
- upper GI tract, you may have a sore throat for a few days. As a result of
56
- the sedation, you should not drive, operate heavy machinery or make any
57
- important decisions for up to six hours following the procedure.
58
  pipeline_tag: feature-extraction
59
  library_name: sentence-transformers
60
  metrics:
@@ -78,7 +56,7 @@ metrics:
78
  - corpus_active_dims
79
  - corpus_sparsity_ratio
80
  model-index:
81
- - name: SPLADE-BERT-Tiny-Distil
82
  results:
83
  - task:
84
  type: sparse-information-retrieval
@@ -88,86 +66,94 @@ model-index:
88
  type: unknown
89
  metrics:
90
  - type: dot_accuracy@1
91
- value: 0.4602
92
  name: Dot Accuracy@1
93
  - type: dot_accuracy@3
94
- value: 0.7768
95
  name: Dot Accuracy@3
96
  - type: dot_accuracy@5
97
- value: 0.885
98
  name: Dot Accuracy@5
99
  - type: dot_accuracy@10
100
- value: 0.9548
101
  name: Dot Accuracy@10
102
  - type: dot_precision@1
103
- value: 0.4602
104
  name: Dot Precision@1
105
  - type: dot_precision@3
106
- value: 0.2653333333333333
107
  name: Dot Precision@3
108
  - type: dot_precision@5
109
- value: 0.18391999999999997
110
  name: Dot Precision@5
111
  - type: dot_precision@10
112
- value: 0.10024
113
  name: Dot Precision@10
114
  - type: dot_recall@1
115
- value: 0.4461833333333334
116
  name: Dot Recall@1
117
  - type: dot_recall@3
118
- value: 0.7631166666666666
119
  name: Dot Recall@3
120
  - type: dot_recall@5
121
- value: 0.8761
122
  name: Dot Recall@5
123
  - type: dot_recall@10
124
- value: 0.9500333333333334
125
  name: Dot Recall@10
126
  - type: dot_ndcg@10
127
- value: 0.7094495794736737
128
  name: Dot Ndcg@10
129
  - type: dot_mrr@10
130
- value: 0.6344716666666689
131
  name: Dot Mrr@10
132
  - type: dot_map@100
133
- value: 0.6306882016403095
134
  name: Dot Map@100
135
  - type: query_active_dims
136
- value: 16.77560043334961
137
  name: Query Active Dims
138
  - type: query_sparsity_ratio
139
- value: 0.9994503767632085
140
  name: Query Sparsity Ratio
141
  - type: corpus_active_dims
142
- value: 102.47956598021874
143
  name: Corpus Active Dims
144
  - type: corpus_sparsity_ratio
145
- value: 0.9966424360795421
146
  name: Corpus Sparsity Ratio
147
- datasets:
148
- - microsoft/ms_marco
149
  ---
150
 
151
- # SPLADE-BERT-Tiny-Distil
152
 
153
- This is a SPLADE sparse retrieval model based on BERT-Tiny (4M) that was trained by distilling a Cross-Encoder on the MSMARCO dataset. The cross-encoder used was [ms-marco-MiniLM-L6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2).
 
154
 
155
- This Tiny SPLADE model beats `BM25` by `65.6%` on the MSMARCO benchmark. While this model is `15x` smaller than Naver's official `splade-v3-distilbert`, is posesses `80%` of it's performance on MSMARCO. This model is small enough to be used without a GPU on a dataset of a few thousand documents.
 
 
 
 
 
 
 
 
156
 
157
- - `Collection:` https://huggingface.co/collections/rasyosef/splade-tiny-msmarco-687c548c0691d95babf65b70
158
- - `Distillation Dataset:` https://huggingface.co/datasets/yosefw/msmarco-train-distil-v2
159
- - `Code:` https://github.com/rasyosef/splade-tiny-msmarco
160
 
161
- ## Performance
 
 
 
162
 
163
- The splade models were evaluated on 55 thousand queries and 8 million documents from the [MSMARCO](https://huggingface.co/datasets/microsoft/ms_marco) dataset.
164
 
165
- ||Size (# Params)|MRR@10 (MS MARCO dev)|
166
- |:---|:----|:-------------------|
167
- |`BM25`|-|18.6|-|-|
168
- |`rasyosef/splade-tiny`|4.4M|30.8|
169
- |`rasyosef/splade-mini`|11.2M|32.8|
170
- |`naver/splade-v3-distilbert`|67.0M|38.7|
171
 
172
  ## Usage
173
 
@@ -184,15 +170,15 @@ Then you can load this model and run inference.
184
  from sentence_transformers import SparseEncoder
185
 
186
  # Download from the 🤗 Hub
187
- model = SparseEncoder("rasyosef/splade-tiny")
188
  # Run inference
189
  queries = [
190
- "what is eus appointment",
191
  ]
192
  documents = [
193
- "Endoscopic Ultrasound (EUS). You've been referred to have an endoscopic ultrasound, or EUS, which will help your doctor, evaluate or treat your condition. This brochure will give you a basic understanding of the procedure-how it is performed, how it can help, and what side effects you might experience.our doctor can use EUS to diagnose the cause of conditions such as abdominal pain or abnormal weight loss. Or, if your doctor has ruled out certain conditions, EUS can confirm your diagnosis and give you a clean bill of health.",
194
- 'About EUS (endoscopic ultrasound). An EUS, or endoscopic ultrasound, is an outpatient procedure used to closely examine the tissues in the digestive tract. The procedure is done using a standard endoscope and a tiny ultrasound device.The ultrasound sensor sends back visual images of the digestive tract to a screen, allowing the physician to see deeper into the tissues and the organs beneath the surface of the intestines.. In general, an EUS is a very safe procedure. If your procedure is being done on the upper GI tract, you may have a sore throat for a few days. As a result of the sedation, you should not drive, operate heavy machinery or make any important decisions for up to six hours following the procedure.',
195
- 'Endoscopic Ultrasound (EUS) allows your doctor to examine the lining and the walls of your upper and lower gastrointestinal tract.The upper tract is the esophagus, stomach, and duodenum; the lower tract includes your colon and rectum.Doctors also use EUS to study internal organs that lie next to the gastrointestinal tract, such as the gall bladder and the pancreas. Your endoscopist will use a thin, flexible tube called an endoscope.he upper tract is the esophagus, stomach, and duodenum; the lower tract includes your colon and rectum. Doctors also use EUS to study internal organs that lie next to the gastrointestinal tract, such as the gall bladder and the pancreas.',
196
  ]
197
  query_embeddings = model.encode_query(queries)
198
  document_embeddings = model.encode_document(documents)
@@ -202,7 +188,7 @@ print(query_embeddings.shape, document_embeddings.shape)
202
  # Get the similarity scores for the embeddings
203
  similarities = model.similarity(query_embeddings, document_embeddings)
204
  print(similarities)
205
- # tensor([[12.9370, 14.3277, 12.9725]])
206
  ```
207
 
208
  <!--
@@ -229,37 +215,6 @@ You can finetune this model on your own dataset.
229
  *List how the model may foreseeably be misused and address what users ought not to do with the model.*
230
  -->
231
 
232
- ## Model Details
233
-
234
- ### Model Description
235
- - **Model Type:** SPLADE Sparse Encoder
236
- - **Base model:** [prajjwal1/bert-tiny](https://huggingface.co/prajjwal1/bert-tiny) <!-- at revision 6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837 -->
237
- - **Maximum Sequence Length:** 512 tokens
238
- - **Output Dimensionality:** 30522 dimensions
239
- - **Similarity Function:** Dot Product
240
- <!-- - **Training Dataset:** Unknown -->
241
- - **Language:** en
242
- - **License:** mit
243
-
244
- ### Model Sources
245
-
246
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
247
- - **Documentation:** [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html)
248
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
249
- - **Hugging Face:** [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder)
250
-
251
- ### Full Model Architecture
252
-
253
- ```
254
- SparseEncoder(
255
- (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
256
- (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
257
- )
258
- ```
259
-
260
- ## More
261
- <details><summary>Click to expand</summary>
262
-
263
  ## Evaluation
264
 
265
  ### Metrics
@@ -270,25 +225,25 @@ SparseEncoder(
270
 
271
  | Metric | Value |
272
  |:----------------------|:-----------|
273
- | dot_accuracy@1 | 0.4602 |
274
- | dot_accuracy@3 | 0.7768 |
275
- | dot_accuracy@5 | 0.885 |
276
- | dot_accuracy@10 | 0.9548 |
277
- | dot_precision@1 | 0.4602 |
278
- | dot_precision@3 | 0.2653 |
279
- | dot_precision@5 | 0.1839 |
280
- | dot_precision@10 | 0.1002 |
281
- | dot_recall@1 | 0.4462 |
282
- | dot_recall@3 | 0.7631 |
283
- | dot_recall@5 | 0.8761 |
284
- | dot_recall@10 | 0.95 |
285
- | **dot_ndcg@10** | **0.7094** |
286
- | dot_mrr@10 | 0.6345 |
287
- | dot_map@100 | 0.6307 |
288
- | query_active_dims | 16.7756 |
289
- | query_sparsity_ratio | 0.9995 |
290
- | corpus_active_dims | 102.4796 |
291
- | corpus_sparsity_ratio | 0.9966 |
292
 
293
  <!--
294
  ## Bias, Risks and Limitations
@@ -308,25 +263,25 @@ SparseEncoder(
308
 
309
  #### Unnamed Dataset
310
 
311
- * Size: 496,123 training samples
312
- * Columns: <code>query</code>, <code>positive</code>, <code>negative_1</code>, <code>negative_2</code>, <code>negative_3</code>, <code>negative_4</code>, and <code>label</code>
313
  * Approximate statistics based on the first 1000 samples:
314
- | | query | positive | negative_1 | negative_2 | negative_3 | negative_4 | label |
315
- |:--------|:---------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------|
316
- | type | string | string | string | string | string | string | list |
317
- | details | <ul><li>min: 4 tokens</li><li>mean: 9.09 tokens</li><li>max: 37 tokens</li></ul> | <ul><li>min: 14 tokens</li><li>mean: 80.68 tokens</li><li>max: 215 tokens</li></ul> | <ul><li>min: 20 tokens</li><li>mean: 78.57 tokens</li><li>max: 238 tokens</li></ul> | <ul><li>min: 18 tokens</li><li>mean: 77.8 tokens</li><li>max: 253 tokens</li></ul> | <ul><li>min: 18 tokens</li><li>mean: 76.46 tokens</li><li>max: 248 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 75.9 tokens</li><li>max: 190 tokens</li></ul> | <ul><li>size: 4 elements</li></ul> |
318
  * Samples:
319
- | query | positive | negative_1 | negative_2 | negative_3 | negative_4 | label |
320
- |:-------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|
321
- | <code>could Nexium antacid cause sweating</code> | <code>Summary: Sweating-excessive is found among people who take Nexium, especially for people who are 60+ old, have been taking the drug for.Personalized health information: on eHealthMe you can find out what patients like me (same gender, age) reported their drugs and conditions on FDA and social media since 1977. I am a 56 year old female who has been taking Nexium for 13 years and has been plagued by shingles.. 2 Support group for people who have Sweating-Excessive. 3 Been on warfarin for 6 days and having sweating at times.</code> | <code>More questions for: Nexium, Sweating-excessive. You may be interested at these reviews (Write a review): 1 Xarelto caused shortness of breath. 2 After taking Xarelto for 3 years I suddently experienced shortness of breath, sweating and pain in my arms. 3 Myrbetriq & hyperhidrosis (night sweats). I am a 56 year old female who has been taking Nexium for 13 years and has been plagued by shingles.. 2 Support group for people who have Sweating-Excessive. 3 Been on warfarin for 6 days and having sweating at times.</code> | <code>NEXIUM may help your acid-related symptoms, but you could still have serious stomach problems. Talk with your doctor. NEXIUM can cause serious side effects, including: 1 Diarrhea. 2 NEXIUM may increase your risk of getting severe diarrhea.3 This diarrhea may be caused by an infection (Clostridium difficile) in your intestines.EXIUM can cause serious side effects, including: 1 Diarrhea. 2 NEXIUM may increase your risk of getting severe diarrhea. 3 This diarrhea may be caused by an infection (Clostridium difficile) in your intestines.</code> | <code>Treatment for sweating. The treatment you have will depend on the cause of your sweating. If you have an infection, antibiotics will treat the infection and stop the sweating. If your sweating is due to cancer, treating the cancer can get rid of the sweating.If you have sweating because treatment has changed your hormone levels, it may settle down after a few weeks or months, once your body is used to the treatment. Talk to your doctor or nurse about your sweats.nfection. Infection is one of the most common causes of sweating in people who have cancer. Infection can give you a high temperature and your body sweats to try and reduce it. Treating the infection can control or stop the sweating.</code> | <code>Esomeprazole is used to treat certain stomach and esophagus problems (such as acid reflux, ulcers). It works by decreasing the amount of acid your stomach makes.ide Effects. See also Precautions section. Headache or abdominal pain may occur. If any of these effects persist or worsen, tell your doctor or pharmacist promptly. Remember that your doctor has prescribed this medication because he or she has judged that the benefit to you is greater than the risk of side effects.</code> | <code>[0.5, 6.390576362609863, 11.97206974029541, 16.409034729003906]</code> |
322
- | <code>what is electronic document access</code> | <code>Electronic Document Access (EDA) is a web-based system that provides secure online access, storage, and retrieval of contracts, contract modifications, Government Bills of Lading (GBLs), DFAS Transactions for Others (E110), vouchers, and Contract Deficiency Reports (CDR) to authorized users throughout the Department of Defense (DoD).</code> | <code>An electronic document management system (EDMS) is a software system for organizing and storing different kinds of documents. This type of system is a more particular kind of document management system, a more general type of storage system that helps users to organize and store paper or digital documents.</code> | <code>In many cases, the specific documentation for original storage protocols is a major part of what makes an electronic document management system so valuable to a business or organization.</code> | <code>Benefits derived from DoD EDA include: 1 Single-source, timely information. 2 Electronic search and retrieval 24/7 access/retrieval capability. 3 Increased visibility of all procurement & payment actions. Reduction in data entry/human 1 error. Lower postage, handling, retention and document management costs.</code> | <code>If YES, go to www.docusign.net and log in with your email and password. On the DocuSign Web Application, select the Documents tab. Your documents are listed there. If NO, you can access the document by opening the DocuSign Completed email. This email is sent to you once you have finished signing a DocuSign document. See the instructions below. Note: In some cases, your documents might be attached to the Completed email. 1. Open the DocuSign Completed email.</code> | <code>[4.681269645690918, 9.322907447814941, 14.813400268554688, 20.356698989868164]</code> |
323
- | <code>does hpv cause uti</code> | <code>So now you get in the acidic environment can hpv cause urinary tract infection for the area of the blockage of the fruits and fiber as a completely eliminate urinate at all. Spending money on prescription of antibiotics will kill all of the bacterial infection keeps happening to your veterinarian will work to cure the condition.</code> | <code>HPV & Urinary Tract Infections. Human Papillomavirus (HPV) is a group of viruses that can cause warts and cancers of the cervix, anus and genitals. Urinary tract infection (UTI) occurs when bacteria multiply within the bladder, causing pain and urinary urgency. (Thomas Northcut/Digital Vision/Getty Images) Other People Are Reading.</code> | <code>Some types of the HPV virus can infect the genital epithelial cells (skin and mucous membranes). Some types of HPV virus cause warts that appear on the genitals (vagina, vulva, penis, etc.) and anus of women and men.</code> | <code>Most women with HPV have no signs of infection. Since most HPV infections go away on their own within two years, many women never know they had an infection. Some HPV infections cause genital warts that can be seen or felt. The only way to know if you have HPV is to ask your health care provider to do an HPV test.</code> | <code>Genital warts are caused by low-risk types of human papillomavirus (HPV). These viruses may not cause warts in everyone. Women can get genital warts from sexual contact with someone who has HPV. Genital warts are spread by skin-to-skin contact, usually from contact with the warts. It can be spread by vaginal, anal, oral, or handgenital sexual contact. Genital warts will spread HPV while visible, and after recent treatment.</code> | <code>[0.5, 2.4958395957946777, 3.76273775100708, 4.114340305328369]</code> |
324
  * Loss: [<code>SpladeLoss</code>](https://sbert.net/docs/package_reference/sparse_encoder/losses.html#spladeloss) with these parameters:
325
  ```json
326
  {
327
  "loss": "SparseMarginMSELoss",
328
- "document_regularizer_weight": 0.3,
329
- "query_regularizer_weight": 0.5
330
  }
331
  ```
332
 
@@ -334,15 +289,15 @@ SparseEncoder(
334
  #### Non-Default Hyperparameters
335
 
336
  - `eval_strategy`: epoch
337
- - `per_device_train_batch_size`: 48
338
- - `per_device_eval_batch_size`: 48
339
- - `learning_rate`: 8e-05
340
- - `num_train_epochs`: 6
341
  - `lr_scheduler_type`: cosine
342
- - `warmup_ratio`: 0.025
343
  - `fp16`: True
344
  - `load_best_model_at_end`: True
345
  - `optim`: adamw_torch_fused
 
346
 
347
  #### All Hyperparameters
348
  <details><summary>Click to expand</summary>
@@ -351,24 +306,24 @@ SparseEncoder(
351
  - `do_predict`: False
352
  - `eval_strategy`: epoch
353
  - `prediction_loss_only`: True
354
- - `per_device_train_batch_size`: 48
355
- - `per_device_eval_batch_size`: 48
356
  - `per_gpu_train_batch_size`: None
357
  - `per_gpu_eval_batch_size`: None
358
  - `gradient_accumulation_steps`: 1
359
  - `eval_accumulation_steps`: None
360
  - `torch_empty_cache_steps`: None
361
- - `learning_rate`: 8e-05
362
  - `weight_decay`: 0.0
363
  - `adam_beta1`: 0.9
364
  - `adam_beta2`: 0.999
365
  - `adam_epsilon`: 1e-08
366
  - `max_grad_norm`: 1.0
367
- - `num_train_epochs`: 6
368
  - `max_steps`: -1
369
  - `lr_scheduler_type`: cosine
370
  - `lr_scheduler_kwargs`: {}
371
- - `warmup_ratio`: 0.025
372
  - `warmup_steps`: 0
373
  - `log_level`: passive
374
  - `log_level_replica`: warning
@@ -425,7 +380,7 @@ SparseEncoder(
425
  - `dataloader_persistent_workers`: False
426
  - `skip_memory_metrics`: True
427
  - `use_legacy_prediction_loop`: False
428
- - `push_to_hub`: False
429
  - `resume_from_checkpoint`: None
430
  - `hub_model_id`: None
431
  - `hub_strategy`: every_save
@@ -468,21 +423,20 @@ SparseEncoder(
468
  </details>
469
 
470
  ### Training Logs
471
- | Epoch | Step | Training Loss | dot_ndcg@10 |
472
- |:-------:|:---------:|:-------------:|:-----------:|
473
- | 1.0 | 10336 | 16309.8824 | 0.6698 |
474
- | 2.0 | 20672 | 14.4047 | 0.6920 |
475
- | 3.0 | 31008 | 13.0742 | 0.7004 |
476
- | 4.0 | 41344 | 11.8023 | 0.7060 |
477
- | 5.0 | 51680 | 11.0464 | 0.7085 |
478
- | **6.0** | **62016** | **10.6766** | **0.7094** |
479
 
480
  * The bold row denotes the saved checkpoint.
481
 
482
  ### Framework Versions
483
  - Python: 3.11.13
484
  - Sentence Transformers: 5.0.0
485
- - Transformers: 4.53.2
486
  - PyTorch: 2.6.0+cu124
487
  - Accelerate: 1.8.1
488
  - Datasets: 4.0.0
@@ -556,6 +510,4 @@ SparseEncoder(
556
  ## Model Card Contact
557
 
558
  *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
559
- -->
560
-
561
- </details>
 
1
  ---
 
 
 
2
  tags:
3
  - sentence-transformers
4
  - sparse-encoder
5
  - sparse
6
  - splade
7
  - generated_from_trainer
8
+ - dataset_size:1200000
9
  - loss:SpladeLoss
10
  - loss:SparseMarginMSELoss
11
  - loss:FlopsLoss
12
+ base_model: yosefw/SPLADE-BERT-Tiny-BS256
13
  widget:
14
+ - text: 'Most Referenced:report - Return to the USDOJ/OIG Home Page - US Department
15
+ of JusticeReturn to the USDOJ/OIG Home Page - US Department of Justice. Opinion:Roberts:
16
+ Feds to stop using private prisons.'
17
+ - text: 'Paul O''Neill, the founder of the Trans-Siberian Orchestra (pictured) has
18
+ died at age 61. Paul O''Neill, the founder of the popular Christmas-themed rock
19
+ ensemble Trans-Siberian Orchestra has died. A statement on the group''s Facebook
20
+ page reads: The entire Trans-Siberian Orchestra family, past and present, is heartbroken
21
+ to share the devastating news that Paul O’Neill has passed away from chronic illness.'
22
+ - text: meaning for concern
23
+ - text: 'Additional Tips. 1 Do not rub the ink stains as it can spread the stains
24
+ further. 2 Make sure you test the cleaning solution on a small, hidden area to
25
+ check if it is suitable for the material. 3 In case an ink stain has become old
26
+ and dried, the above mentioned home remedies may not be effective.arpet: For ink
27
+ stained spots on a carpet, you may apply a paste of cornstarch and milk. Leave
28
+ it for a few hours before brushing it off. Finally, clean the residue with a vacuum
29
+ cleaner. Leather: Try using a leather shampoo or a leather ink remover for removing
30
+ ink stains from leather items.'
31
+ - text: 'See below: 1. Get your marriage license. Before you can change your name,
32
+ you''ll need the original (or certified) marriage license with the raised seal
33
+ and your new last name on it. Call the clerk''s office where your license was
34
+ filed to get copies if one wasn''t automatically sent to you. 2. Change your Social
35
+ Security card.'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  pipeline_tag: feature-extraction
37
  library_name: sentence-transformers
38
  metrics:
 
56
  - corpus_active_dims
57
  - corpus_sparsity_ratio
58
  model-index:
59
+ - name: SPLADE Sparse Encoder
60
  results:
61
  - task:
62
  type: sparse-information-retrieval
 
66
  type: unknown
67
  metrics:
68
  - type: dot_accuracy@1
69
+ value: 0.4772
70
  name: Dot Accuracy@1
71
  - type: dot_accuracy@3
72
+ value: 0.793
73
  name: Dot Accuracy@3
74
  - type: dot_accuracy@5
75
+ value: 0.8964
76
  name: Dot Accuracy@5
77
  - type: dot_accuracy@10
78
+ value: 0.96
79
  name: Dot Accuracy@10
80
  - type: dot_precision@1
81
+ value: 0.4772
82
  name: Dot Precision@1
83
  - type: dot_precision@3
84
+ value: 0.2713333333333333
85
  name: Dot Precision@3
86
  - type: dot_precision@5
87
+ value: 0.18644000000000002
88
  name: Dot Precision@5
89
  - type: dot_precision@10
90
+ value: 0.10094000000000002
91
  name: Dot Precision@10
92
  - type: dot_recall@1
93
+ value: 0.4616666666666666
94
  name: Dot Recall@1
95
  - type: dot_recall@3
96
+ value: 0.7798833333333334
97
  name: Dot Recall@3
98
  - type: dot_recall@5
99
+ value: 0.8874
100
  name: Dot Recall@5
101
  - type: dot_recall@10
102
+ value: 0.95595
103
  name: Dot Recall@10
104
  - type: dot_ndcg@10
105
+ value: 0.721747648718731
106
  name: Dot Ndcg@10
107
  - type: dot_mrr@10
108
+ value: 0.6489996031746051
109
  name: Dot Mrr@10
110
  - type: dot_map@100
111
+ value: 0.6446961471449598
112
  name: Dot Map@100
113
  - type: query_active_dims
114
+ value: 18.334199905395508
115
  name: Query Active Dims
116
  - type: query_sparsity_ratio
117
+ value: 0.9993993119747921
118
  name: Query Sparsity Ratio
119
  - type: corpus_active_dims
120
+ value: 121.65303042911474
121
  name: Corpus Active Dims
122
  - type: corpus_sparsity_ratio
123
+ value: 0.9960142510179831
124
  name: Corpus Sparsity Ratio
 
 
125
  ---
126
 
127
+ # SPLADE Sparse Encoder
128
 
129
+ This is a [SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model finetuned from [yosefw/SPLADE-BERT-Tiny-BS256](https://huggingface.co/yosefw/SPLADE-BERT-Tiny-BS256) using the [sentence-transformers](https://www.SBERT.net) library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
130
+ ## Model Details
131
 
132
+ ### Model Description
133
+ - **Model Type:** SPLADE Sparse Encoder
134
+ - **Base model:** [yosefw/SPLADE-BERT-Tiny-BS256](https://huggingface.co/yosefw/SPLADE-BERT-Tiny-BS256) <!-- at revision 239bb34bbfcf6cc8b465eb5b94c76a20c574b47f -->
135
+ - **Maximum Sequence Length:** 512 tokens
136
+ - **Output Dimensionality:** 30522 dimensions
137
+ - **Similarity Function:** Dot Product
138
+ <!-- - **Training Dataset:** Unknown -->
139
+ <!-- - **Language:** Unknown -->
140
+ <!-- - **License:** Unknown -->
141
 
142
+ ### Model Sources
 
 
143
 
144
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
145
+ - **Documentation:** [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html)
146
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
147
+ - **Hugging Face:** [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder)
148
 
149
+ ### Full Model Architecture
150
 
151
+ ```
152
+ SparseEncoder(
153
+ (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
154
+ (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
155
+ )
156
+ ```
157
 
158
  ## Usage
159
 
 
170
  from sentence_transformers import SparseEncoder
171
 
172
  # Download from the 🤗 Hub
173
+ model = SparseEncoder("yosefw/SPLADE-BERT-Tiny-BS256-distil-v3")
174
  # Run inference
175
  queries = [
176
+ "what do i need to change my name on my license in ma",
177
  ]
178
  documents = [
179
+ 'Change your name on MA state-issued ID such as driver’s license or MA ID card. All documents you bring to RMV need to be originals or certified copies by the issuing agency. PAPERWORK NEEDED: Proof of legal name change A court order showing your legal name change. Your Social Security Card with your new legal name change',
180
+ "See below: 1. Get your marriage license. Before you can change your name, you'll need the original (or certified) marriage license with the raised seal and your new last name on it. Call the clerk's office where your license was filed to get copies if one wasn't automatically sent to you. 2. Change your Social Security card.",
181
+ "You'll keep the same number—just your name will be different. Mail in your application to the local Social Security Administration office. You should get your new card within 10 business days. 3. Change your license at the DMV. Take a trip to the local Department of Motor Vehicles office to get a new license with your new last name. Bring every form of identification you can get your hands on—your old license, your certified marriage certificate and, most importantly, your new Social Security card.",
182
  ]
183
  query_embeddings = model.encode_query(queries)
184
  document_embeddings = model.encode_document(documents)
 
188
  # Get the similarity scores for the embeddings
189
  similarities = model.similarity(query_embeddings, document_embeddings)
190
  print(similarities)
191
+ # tensor([[16.6297, 13.4552, 10.1923]])
192
  ```
193
 
194
  <!--
 
215
  *List how the model may foreseeably be misused and address what users ought not to do with the model.*
216
  -->
217
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
  ## Evaluation
219
 
220
  ### Metrics
 
225
 
226
  | Metric | Value |
227
  |:----------------------|:-----------|
228
+ | dot_accuracy@1 | 0.4772 |
229
+ | dot_accuracy@3 | 0.793 |
230
+ | dot_accuracy@5 | 0.8964 |
231
+ | dot_accuracy@10 | 0.96 |
232
+ | dot_precision@1 | 0.4772 |
233
+ | dot_precision@3 | 0.2713 |
234
+ | dot_precision@5 | 0.1864 |
235
+ | dot_precision@10 | 0.1009 |
236
+ | dot_recall@1 | 0.4617 |
237
+ | dot_recall@3 | 0.7799 |
238
+ | dot_recall@5 | 0.8874 |
239
+ | dot_recall@10 | 0.9559 |
240
+ | **dot_ndcg@10** | **0.7217** |
241
+ | dot_mrr@10 | 0.649 |
242
+ | dot_map@100 | 0.6447 |
243
+ | query_active_dims | 18.3342 |
244
+ | query_sparsity_ratio | 0.9994 |
245
+ | corpus_active_dims | 121.653 |
246
+ | corpus_sparsity_ratio | 0.996 |
247
 
248
  <!--
249
  ## Bias, Risks and Limitations
 
263
 
264
  #### Unnamed Dataset
265
 
266
+ * Size: 1,200,000 training samples
267
+ * Columns: <code>query</code>, <code>positive</code>, <code>negative_1</code>, <code>negative_2</code>, and <code>label</code>
268
  * Approximate statistics based on the first 1000 samples:
269
+ | | query | positive | negative_1 | negative_2 | label |
270
+ |:--------|:---------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-----------------------------------|
271
+ | type | string | string | string | string | list |
272
+ | details | <ul><li>min: 4 tokens</li><li>mean: 9.08 tokens</li><li>max: 35 tokens</li></ul> | <ul><li>min: 23 tokens</li><li>mean: 79.02 tokens</li><li>max: 192 tokens</li></ul> | <ul><li>min: 18 tokens</li><li>mean: 78.24 tokens</li><li>max: 230 tokens</li></ul> | <ul><li>min: 13 tokens</li><li>mean: 75.26 tokens</li><li>max: 230 tokens</li></ul> | <ul><li>size: 2 elements</li></ul> |
273
  * Samples:
274
+ | query | positive | negative_1 | negative_2 | label |
275
+ |:----------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------|
276
+ | <code>does alzheimer's affect sleep</code> | <code>People with Alzheimer’s disease go through many changes, and sleep problems are often some of the most noticeable. Most adults have changes in their sleep patterns as they age. But the problems are more severe and happen more often for people with Alzheimer’s.</code> | <code>Could the position you SLEEP in affect your risk of Alzheimer's? People who sleep on their side enable their brain to 'detox' better while they rest. While asleep, brain is hard at work removing toxins that build up in the day. If left to build up, these toxins can cause Alzheimer's and Parkinson's.</code> | <code>The Scary Connection Between Snoring and Dementia. For more, visit TIME Health. If you don't snore, you likely know someone who does. Between 19% and 40% of adults snore when they sleep, and that percentage climbs even higher, particularly for men, as we age.</code> | <code>[1.407266616821289, 10.169305801391602]</code> |
277
+ | <code>what is fy in steel design</code> | <code>Since the yield strength of the steel is quite clearly defined and controlled, this establishes a very precise reference in structural investigations. An early design decision is that for the yield strength (specified by the Grade of steel used) that is to be used in the design work.Several different grades of steel may be used for large projects, with a minimum grade for ordinary tasks and higher grades for more demanding ones.ost steel used for reinforcement is highly ductile in nature. Its usable strength is its yield strength, as this stress condition initiates such a magnitude of deformation (into the plastic yielding range of the steel), that major cracking will occur in the concrete.</code> | <code>fy is the yield point of the material. E is the symbol for Young's Modulus of the material. E can be measured by dividing the elastic stress by the elastic strain.That is, this measurement must be made before the yield point of the material is reached.y is the yield point of the material. E is the symbol for Young's Modulus of the material. E can be measured by dividing the elastic stress by the elastic strain.</code> | <code>The longest dimension of the cant. WT is 13'. Using ASTM A992 carbon steel, a WT9x35.5 is at full bending stress and deflection limits. (Fy = 50 ksi). The only information I've found about using stainless for structural design is that type 304 is usually used.This yield strength (Fy) is only equal to 39 or 42ksi.sing ASTM A992 carbon steel, a WT9x35.5 is at full bending stress and deflection limits. (Fy = 50 ksi). The only information I've found about using stainless for structural design is that type 304 is usually used.</code> | <code>[0.5, 0.5]</code> |
278
+ | <code>most common nutritional deficiencies for teenagers</code> | <code>: Appendix B: Vitamin and Mineral Deficiencies in the U.S. Some American adults get too little vitamin D, vitamin E, magnesium, calcium, vitamin A and vitamin C (Table B1). More than 40 percent of adults have dietary intakes of vitamin A, C, D and E, calcium and magnesium below the average requirement for their age and gender. Inadequate intake of vitamins and minerals is most common among 14-to-18-year-old teenagers. Adolescent girls have lower nutrient intake than boys (Berner 2014; Fulgoni 2011). But nutrient deficiencies are rare among younger American children; the exceptions are dietary vitamin D and E, for which intake is low for all Americans, and calcium.</code> | <code>Common Nutritional Deficiencies. 10 Most Common Nutritional Deficiencies.. Calcium. Calcium is one of the most abundant minerals in your body, yet most people still manage to have a calcium deficiency. Calcium is best know for adding strength to your bones and teeth.</code> | <code>1) Vitamin D–Vitamin D deficiency is common in infants born to mothers with low levels of Vitamin D. Severe deficiency of this nutrient in infancy and early childhood can lead to the development of Rickets, a disease that affects bone formation and causes bow-legs.</code> | <code>[3.182860851287842, 7.834665775299072]</code> |
279
  * Loss: [<code>SpladeLoss</code>](https://sbert.net/docs/package_reference/sparse_encoder/losses.html#spladeloss) with these parameters:
280
  ```json
281
  {
282
  "loss": "SparseMarginMSELoss",
283
+ "document_regularizer_weight": 0.2,
284
+ "query_regularizer_weight": 0.3
285
  }
286
  ```
287
 
 
289
  #### Non-Default Hyperparameters
290
 
291
  - `eval_strategy`: epoch
292
+ - `per_device_train_batch_size`: 32
293
+ - `per_device_eval_batch_size`: 32
294
+ - `num_train_epochs`: 5
 
295
  - `lr_scheduler_type`: cosine
296
+ - `warmup_ratio`: 0.05
297
  - `fp16`: True
298
  - `load_best_model_at_end`: True
299
  - `optim`: adamw_torch_fused
300
+ - `push_to_hub`: True
301
 
302
  #### All Hyperparameters
303
  <details><summary>Click to expand</summary>
 
306
  - `do_predict`: False
307
  - `eval_strategy`: epoch
308
  - `prediction_loss_only`: True
309
+ - `per_device_train_batch_size`: 32
310
+ - `per_device_eval_batch_size`: 32
311
  - `per_gpu_train_batch_size`: None
312
  - `per_gpu_eval_batch_size`: None
313
  - `gradient_accumulation_steps`: 1
314
  - `eval_accumulation_steps`: None
315
  - `torch_empty_cache_steps`: None
316
+ - `learning_rate`: 5e-05
317
  - `weight_decay`: 0.0
318
  - `adam_beta1`: 0.9
319
  - `adam_beta2`: 0.999
320
  - `adam_epsilon`: 1e-08
321
  - `max_grad_norm`: 1.0
322
+ - `num_train_epochs`: 5
323
  - `max_steps`: -1
324
  - `lr_scheduler_type`: cosine
325
  - `lr_scheduler_kwargs`: {}
326
+ - `warmup_ratio`: 0.05
327
  - `warmup_steps`: 0
328
  - `log_level`: passive
329
  - `log_level_replica`: warning
 
380
  - `dataloader_persistent_workers`: False
381
  - `skip_memory_metrics`: True
382
  - `use_legacy_prediction_loop`: False
383
+ - `push_to_hub`: True
384
  - `resume_from_checkpoint`: None
385
  - `hub_model_id`: None
386
  - `hub_strategy`: every_save
 
423
  </details>
424
 
425
  ### Training Logs
426
+ | Epoch | Step | Training Loss | dot_ndcg@10 |
427
+ |:-------:|:----------:|:-------------:|:-----------:|
428
+ | 1.0 | 37500 | 11.4095 | 0.7103 |
429
+ | 2.0 | 75000 | 10.5305 | 0.7139 |
430
+ | 3.0 | 112500 | 9.5368 | 0.7197 |
431
+ | **4.0** | **150000** | **8.717** | **0.7216** |
432
+ | 5.0 | 187500 | 8.3094 | 0.7217 |
 
433
 
434
  * The bold row denotes the saved checkpoint.
435
 
436
  ### Framework Versions
437
  - Python: 3.11.13
438
  - Sentence Transformers: 5.0.0
439
+ - Transformers: 4.54.0
440
  - PyTorch: 2.6.0+cu124
441
  - Accelerate: 1.8.1
442
  - Datasets: 4.0.0
 
510
  ## Model Card Contact
511
 
512
  *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
513
+ -->
 
 
config.json CHANGED
@@ -17,7 +17,7 @@
17
  "pad_token_id": 0,
18
  "position_embedding_type": "absolute",
19
  "torch_dtype": "float32",
20
- "transformers_version": "4.53.2",
21
  "type_vocab_size": 2,
22
  "use_cache": true,
23
  "vocab_size": 30522
 
17
  "pad_token_id": 0,
18
  "position_embedding_type": "absolute",
19
  "torch_dtype": "float32",
20
+ "transformers_version": "4.54.0",
21
  "type_vocab_size": 2,
22
  "use_cache": true,
23
  "vocab_size": 30522
config_sentence_transformers.json CHANGED
@@ -2,7 +2,7 @@
2
  "model_type": "SparseEncoder",
3
  "__version__": {
4
  "sentence_transformers": "5.0.0",
5
- "transformers": "4.53.2",
6
  "pytorch": "2.6.0+cu124"
7
  },
8
  "prompts": {
 
2
  "model_type": "SparseEncoder",
3
  "__version__": {
4
  "sentence_transformers": "5.0.0",
5
+ "transformers": "4.54.0",
6
  "pytorch": "2.6.0+cu124"
7
  },
8
  "prompts": {
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c5578e5c58d8ff1c071f9ef9a555c2694c08a5b4c196697e4e199218dcc64ff0
3
  size 17671560
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c31a966edf1180fdc74e1e2056564d353711ec001c395cc9bd00d5368ea38ee
3
  size 17671560