fjmgAI commited on
Commit
9c2036a
·
verified ·
1 Parent(s): a625032

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -436
README.md CHANGED
@@ -28,40 +28,36 @@ model-index:
28
  - type: accuracy
29
  value: 0.9848384857177734
30
  name: Accuracy
 
 
 
 
31
  ---
 
32
 
33
- # PyLate model based on EuroBERT/EuroBERT-210m
34
 
35
- This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [EuroBERT/EuroBERT-210m](https://huggingface.co/EuroBERT/EuroBERT-210m) on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
36
 
37
- ## Model Details
 
38
 
39
- ### Model Description
40
- - **Model Type:** PyLate model
41
- - **Base model:** [EuroBERT/EuroBERT-210m](https://huggingface.co/EuroBERT/EuroBERT-210m) <!-- at revision 5a0c63d3e255a4f2005d3591d5508b7fd07cae94 -->
42
- - **Document Length:** 180 tokens
43
- - **Query Length:** 32 tokens
44
- - **Output Dimensionality:** 128 tokens
45
- - **Similarity Function:** MaxSim
46
- - **Training Dataset:**
47
- - [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)
48
- <!-- - **Language:** Unknown -->
49
- <!-- - **License:** Unknown -->
50
 
51
- ### Model Sources
 
52
 
53
- - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
54
- - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
55
- - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
56
 
57
- ### Full Model Architecture
 
 
58
 
59
- ```
60
- ColBERT(
61
- (0): Transformer({'max_seq_length': 31, 'do_lower_case': False}) with Transformer model: EuroBertModel
62
- (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
63
- )
64
- ```
65
 
66
  ## Usage
67
  First install the PyLate library:
@@ -70,379 +66,79 @@ First install the PyLate library:
70
  pip install -U pylate
71
  ```
72
 
73
- ### Retrieval
74
-
75
- PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
76
-
77
- #### Indexing documents
78
-
79
- First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
80
-
81
- ```python
82
- from pylate import indexes, models, retrieve
83
-
84
- # Step 1: Load the ColBERT model
85
- model = models.ColBERT(
86
- model_name_or_path=pylate_model_id,
87
- )
88
-
89
- # Step 2: Initialize the Voyager index
90
- index = indexes.Voyager(
91
- index_folder="pylate-index",
92
- index_name="index",
93
- override=True, # This overwrites the existing index if any
94
- )
95
-
96
- # Step 3: Encode the documents
97
- documents_ids = ["1", "2", "3"]
98
- documents = ["document 1 text", "document 2 text", "document 3 text"]
99
-
100
- documents_embeddings = model.encode(
101
- documents,
102
- batch_size=32,
103
- is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
104
- show_progress_bar=True,
105
- )
106
-
107
- # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
108
- index.add_documents(
109
- documents_ids=documents_ids,
110
- documents_embeddings=documents_embeddings,
111
- )
112
- ```
113
-
114
- Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
115
-
116
- ```python
117
- # To load an index, simply instantiate it with the correct folder/name and without overriding it
118
- index = indexes.Voyager(
119
- index_folder="pylate-index",
120
- index_name="index",
121
- )
122
- ```
123
-
124
- #### Retrieving top-k documents for queries
125
-
126
- Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
127
- To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
128
-
129
- ```python
130
- # Step 1: Initialize the ColBERT retriever
131
- retriever = retrieve.ColBERT(index=index)
132
-
133
- # Step 2: Encode the queries
134
- queries_embeddings = model.encode(
135
- ["query for document 3", "query for document 1"],
136
- batch_size=32,
137
- is_query=True, # # Ensure that it is set to False to indicate that these are queries
138
- show_progress_bar=True,
139
- )
140
-
141
- # Step 3: Retrieve top-k documents
142
- scores = retriever.retrieve(
143
- queries_embeddings=queries_embeddings,
144
- k=10, # Retrieve the top 10 matches for each query
145
- )
146
- ```
147
-
148
- ### Reranking
149
- If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
150
 
151
  ```python
152
- from pylate import rank, models
153
-
154
- queries = [
155
- "query A",
156
- "query B",
157
- ]
158
-
159
- documents = [
160
- ["document A", "document B"],
161
- ["document 1", "document C", "document B"],
162
- ]
163
-
164
- documents_ids = [
165
- [1, 2],
166
- [1, 3, 2],
167
- ]
168
-
169
- model = models.ColBERT(
170
- model_name_or_path=pylate_model_id,
171
- )
172
-
173
- queries_embeddings = model.encode(
174
- queries,
175
- is_query=True,
176
- )
177
-
178
- documents_embeddings = model.encode(
179
- documents,
180
- is_query=False,
181
- )
182
-
183
- reranked_documents = rank.rerank(
184
- documents_ids=documents_ids,
185
- queries_embeddings=queries_embeddings,
186
- documents_embeddings=documents_embeddings,
187
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
  ```
189
 
190
- <!--
191
- ### Direct Usage (Transformers)
192
-
193
- <details><summary>Click to see the direct usage in Transformers</summary>
194
-
195
- </details>
196
- -->
197
-
198
- <!--
199
- ### Downstream Usage (Sentence Transformers)
200
-
201
- You can finetune this model on your own dataset.
202
-
203
- <details><summary>Click to expand</summary>
204
-
205
- </details>
206
- -->
207
-
208
- <!--
209
- ### Out-of-Scope Use
210
-
211
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
212
- -->
213
-
214
- ## Evaluation
215
-
216
- ### Metrics
217
-
218
- #### Col BERTTriplet
219
-
220
- * Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>
221
-
222
- | Metric | Value |
223
- |:-------------|:-----------|
224
- | **accuracy** | **0.9848** |
225
-
226
- <!--
227
- ## Bias, Risks and Limitations
228
-
229
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
230
- -->
231
-
232
- <!--
233
- ### Recommendations
234
-
235
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
236
- -->
237
-
238
- ## Training Details
239
-
240
- ### Training Dataset
241
-
242
- #### rag-comprehensive-triplets
243
-
244
- * Dataset: [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) at [678e83e](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets/tree/678e83ed6a74d17c38b33344168abc7787e39754)
245
- * Size: 909,188 training samples
246
- * Columns: <code>query</code>, <code>positive</code>, <code>negative</code>, <code>original_id</code>, <code>dataset_source</code>, <code>category</code>, and <code>language</code>
247
- * Approximate statistics based on the first 1000 samples:
248
- | | query | positive | negative | original_id | dataset_source | category | language |
249
- |:--------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:-------------------------------------------------------------------------------|
250
- | type | string | string | string | string | string | string | string |
251
- | details | <ul><li>min: 8 tokens</li><li>mean: 23.7 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 28.42 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 29.19 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 3.93 tokens</li><li>max: 4 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 16.0 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 4.62 tokens</li><li>max: 5 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 3.0 tokens</li><li>max: 3 tokens</li></ul> |
252
- * Samples:
253
- | query | positive | negative | original_id | dataset_source | category | language |
254
- |:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------|:---------------------------------------------------------------|:------------------------------|:----------------|
255
- | <code>Escriba una historia sobre un viaje de pesca en hielo en Minnesota que incluya todos los detalles importantes, desde la ropa hasta el equipo de pesca, y que también mencione la importancia de la seguridad y los buenos modales</code> | <code>Ah, ¡invierno! Es hora de ponerse la ropa interior larga. Ponte unos calcetines de lana y un jersey. Ponte los pantalones de nieve. Ponte el gorro de media. Coge la caña de pescar y el cubo de cebo.<br>Hay hielo en el lago, y es la estación de disfrutar de una auténtica aventura en Minnesota: la pesca en hielo. No te preocupes por pasar frío o aburrirte en un lago helado. Pescar en el hielo es fácil y emocionante. Es divertido caminar por el hielo imaginando soles hambrientos o morsas acechando debajo. Es una aventura pasar el rato alrededor de un agujero de hielo con los amigos y la familia, contando historias y sujetando una caña de pescar de aspecto gracioso mientras esperáis un bocado. Y es emocionante cuando tu bobber se desvanece de repente en el agujero y sacas un pez escurridizo del agua con un chapoteo. Así que coge a un adulto, un termo de cacao caliente y prepárate para una aventura de pesca en el hielo.<br><br>Empieza con una visita a tu tienda de cebos local o a la oficin...</code> | <code>Una aventura de pesca en hielo en Minnesota puede ser una experiencia emocionante y divertida, siempre y cuando se esté preparado con la ropa y el equipo adecuados para el verano</code> | <code>10954</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>creative_writing</code> | <code>es</code> |
256
- | <code>¿Cuáles son los materiales necesarios para realizar la impresión con bloques?</code> | <code>La impresión con bloques es una forma de arte en la que el artista talla un bloque (normalmente de vinilo o goma) y utiliza tinta para imprimir la imagen. Los materiales necesarios para ello son el bloque para tallar, una herramienta de tallado, un rodillo para aplicar la tinta, tinta, papel o material para la imagen. También se necesita una superficie lisa y plana para extender la tinta; una pequeña lámina de cristal o plexiglás funciona bien para ello.</code> | <code>La impresión con bloques es una forma de arte en la que el artista talla un bloque (normalmente de madera o cartón) y utiliza tinta para imprimir la imagen. Los materiales necesarios para ello son el bloque para tallar, una herramienta de tallado, un rodillo para aplicar la tinta, tinta, papel o material para la imagen. También se necesita una superficie rugosa y curva para extender la tinta; una pequeña lámina de plástico o tela funciona bien para ello.</code> | <code>13815</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>brainstorming</code> | <code>es</code> |
257
- | <code>¿Cuál es el propósito de la Primera Enmienda de la Constitución de Estados Unidos?</code> | <code>La Primera Enmienda garantiza la libertad de expresión y de culto en Estados Unidos.</code> | <code>La Primera Enmienda garantiza la libertad de asociación y de expresión en Estados Unidos.</code> | <code>4168</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>open_qa</code> | <code>es</code> |
258
- * Loss: <code>pylate.losses.contrastive.Contrastive</code>
259
-
260
- ### Evaluation Dataset
261
-
262
- #### rag-comprehensive-triplets
263
-
264
- * Dataset: [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) at [678e83e](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets/tree/678e83ed6a74d17c38b33344168abc7787e39754)
265
- * Size: 909,188 evaluation samples
266
- * Columns: <code>query</code>, <code>positive</code>, <code>negative</code>, <code>original_id</code>, <code>dataset_source</code>, <code>category</code>, and <code>language</code>
267
- * Approximate statistics based on the first 1000 samples:
268
- | | query | positive | negative | original_id | dataset_source | category | language |
269
- |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:-------------------------------------------------------------------------------|
270
- | type | string | string | string | string | string | string | string |
271
- | details | <ul><li>min: 6 tokens</li><li>mean: 23.24 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 28.7 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 29.32 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 3.94 tokens</li><li>max: 4 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 16.0 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 4.65 tokens</li><li>max: 5 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 3.0 tokens</li><li>max: 3 tokens</li></ul> |
272
- * Samples:
273
- | query | positive | negative | original_id | dataset_source | category | language |
274
- |:-------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------|:---------------------------------------------------------------|:------------------------------------|:----------------|
275
- | <code>¿Alguien puede decirme sobre el Firefly Music Festival que dura 4 días?</code> | <code>El Festival de Música Firefly es un evento multigénero que se celebra en Dover, Delaware, y que comenzó en 2012</code> | <code>El Festival de Música Firefly es un evento de música en vivo que se celebra en Dover, Delaware, y que comenzó en 2015, con una duración de 5 días y géneros como rock, pop y hip hop.</code> | <code>3446</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>open_qa</code> | <code>es</code> |
276
- | <code>¿Cuáles son los nombres de los siete países alpinos de oeste a este?</code> | <code>Los Alpes (/ælps/) son la cadena montañosa más alta y extensa de Europa, que se extiende aproximadamente 1.200 km a través de siete países alpinos (de oeste a este): Francia, Suiza, Italia, Liechtenstein, Austria, Alemania y Eslovenia.<br>El arco alpino se extiende desde Niza, en el Mediterráneo occidental, hasta Trieste, en el Adriático, y Viena, en el inicio de la cuenca panónica. Las montañas se formaron a lo largo de decenas de millones de años al chocar las placas tectónicas africana y euroasiática. El acortamiento extremo provocado por este acontecimiento hizo que las rocas sedimentarias marinas se elevaran por empuje y plegamiento hasta formar altos picos montañosos como el Mont Blanc y el Cervino.<br>El Mont Blanc se extiende por la frontera franco-italiana y, con 4.809 m, es la montaña más alta de los Alpes. La zona de los Alpes contiene 128 picos de más de 4.000 m de altura.</code> | <code>La región de los Alpes se extiende a lo largo de ocho países, desde Francia en el oeste hasta Hungría en el este.</code> | <code>13897</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>information_extraction</code> | <code>es</code> |
277
- | <code>quiero saber si estos números son pares o no: 13, 200, 334, 420, 5, 12, ¿me puedes decir?</code> | <code>13: Impar<br>200: Pares<br>334: Par<br>420: Par<br>5: Impar<br>12: Par</code> | <code>13 es par, 200 es impar, 334 es par, 420 es par, 5 es par, 12 es impar</code> | <code>12562</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>classification</code> | <code>es</code> |
278
- * Loss: <code>pylate.losses.contrastive.Contrastive</code>
279
-
280
- ### Training Hyperparameters
281
- #### Non-Default Hyperparameters
282
-
283
- - `eval_strategy`: steps
284
- - `per_device_train_batch_size`: 32
285
- - `per_device_eval_batch_size`: 32
286
- - `learning_rate`: 2e-05
287
- - `num_train_epochs`: 1
288
- - `fp16`: True
289
- - `load_best_model_at_end`: True
290
-
291
- #### All Hyperparameters
292
- <details><summary>Click to expand</summary>
293
-
294
- - `overwrite_output_dir`: False
295
- - `do_predict`: False
296
- - `eval_strategy`: steps
297
- - `prediction_loss_only`: True
298
- - `per_device_train_batch_size`: 32
299
- - `per_device_eval_batch_size`: 32
300
- - `per_gpu_train_batch_size`: None
301
- - `per_gpu_eval_batch_size`: None
302
- - `gradient_accumulation_steps`: 1
303
- - `eval_accumulation_steps`: None
304
- - `torch_empty_cache_steps`: None
305
- - `learning_rate`: 2e-05
306
- - `weight_decay`: 0.0
307
- - `adam_beta1`: 0.9
308
- - `adam_beta2`: 0.999
309
- - `adam_epsilon`: 1e-08
310
- - `max_grad_norm`: 1.0
311
- - `num_train_epochs`: 1
312
- - `max_steps`: -1
313
- - `lr_scheduler_type`: linear
314
- - `lr_scheduler_kwargs`: {}
315
- - `warmup_ratio`: 0.0
316
- - `warmup_steps`: 0
317
- - `log_level`: passive
318
- - `log_level_replica`: warning
319
- - `log_on_each_node`: True
320
- - `logging_nan_inf_filter`: True
321
- - `save_safetensors`: True
322
- - `save_on_each_node`: False
323
- - `save_only_model`: False
324
- - `restore_callback_states_from_checkpoint`: False
325
- - `no_cuda`: False
326
- - `use_cpu`: False
327
- - `use_mps_device`: False
328
- - `seed`: 42
329
- - `data_seed`: None
330
- - `jit_mode_eval`: False
331
- - `use_ipex`: False
332
- - `bf16`: False
333
- - `fp16`: True
334
- - `fp16_opt_level`: O1
335
- - `half_precision_backend`: auto
336
- - `bf16_full_eval`: False
337
- - `fp16_full_eval`: False
338
- - `tf32`: None
339
- - `local_rank`: 0
340
- - `ddp_backend`: None
341
- - `tpu_num_cores`: None
342
- - `tpu_metrics_debug`: False
343
- - `debug`: []
344
- - `dataloader_drop_last`: False
345
- - `dataloader_num_workers`: 0
346
- - `dataloader_prefetch_factor`: None
347
- - `past_index`: -1
348
- - `disable_tqdm`: False
349
- - `remove_unused_columns`: True
350
- - `label_names`: None
351
- - `load_best_model_at_end`: True
352
- - `ignore_data_skip`: False
353
- - `fsdp`: []
354
- - `fsdp_min_num_params`: 0
355
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
356
- - `fsdp_transformer_layer_cls_to_wrap`: None
357
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
358
- - `deepspeed`: None
359
- - `label_smoothing_factor`: 0.0
360
- - `optim`: adamw_torch
361
- - `optim_args`: None
362
- - `adafactor`: False
363
- - `group_by_length`: False
364
- - `length_column_name`: length
365
- - `ddp_find_unused_parameters`: None
366
- - `ddp_bucket_cap_mb`: None
367
- - `ddp_broadcast_buffers`: False
368
- - `dataloader_pin_memory`: True
369
- - `dataloader_persistent_workers`: False
370
- - `skip_memory_metrics`: True
371
- - `use_legacy_prediction_loop`: False
372
- - `push_to_hub`: False
373
- - `resume_from_checkpoint`: None
374
- - `hub_model_id`: None
375
- - `hub_strategy`: every_save
376
- - `hub_private_repo`: None
377
- - `hub_always_push`: False
378
- - `gradient_checkpointing`: False
379
- - `gradient_checkpointing_kwargs`: None
380
- - `include_inputs_for_metrics`: False
381
- - `include_for_metrics`: []
382
- - `eval_do_concat_batches`: True
383
- - `fp16_backend`: auto
384
- - `push_to_hub_model_id`: None
385
- - `push_to_hub_organization`: None
386
- - `mp_parameters`:
387
- - `auto_find_batch_size`: False
388
- - `full_determinism`: False
389
- - `torchdynamo`: None
390
- - `ray_scope`: last
391
- - `ddp_timeout`: 1800
392
- - `torch_compile`: False
393
- - `torch_compile_backend`: None
394
- - `torch_compile_mode`: None
395
- - `dispatch_batches`: None
396
- - `split_batches`: None
397
- - `include_tokens_per_second`: False
398
- - `include_num_input_tokens_seen`: False
399
- - `neftune_noise_alpha`: None
400
- - `optim_target_modules`: None
401
- - `batch_eval_metrics`: False
402
- - `eval_on_start`: False
403
- - `use_liger_kernel`: False
404
- - `eval_use_gather_object`: False
405
- - `average_tokens_across_devices`: False
406
- - `prompts`: None
407
- - `batch_sampler`: batch_sampler
408
- - `multi_dataset_batch_sampler`: proportional
409
-
410
- </details>
411
-
412
- ### Training Logs
413
- | Epoch | Step | Training Loss | Validation Loss | accuracy |
414
- |:----------:|:--------:|:-------------:|:---------------:|:--------:|
415
- | 0.1065 | 500 | 1.6396 | - | - |
416
- | 0 | 0 | - | - | 0.8016 |
417
- | 0.1065 | 500 | - | 0.8725 | - |
418
- | 0.2131 | 1000 | 0.699 | - | - |
419
- | 0 | 0 | - | - | 0.8968 |
420
- | 0.2131 | 1000 | - | 0.5092 | - |
421
- | 0.3196 | 1500 | 0.4315 | - | - |
422
- | 0 | 0 | - | - | 0.9242 |
423
- | 0.3196 | 1500 | - | 0.3369 | - |
424
- | 0.4262 | 2000 | 0.2833 | - | - |
425
- | 0 | 0 | - | - | 0.9522 |
426
- | 0.4262 | 2000 | - | 0.2331 | - |
427
- | 0.5327 | 2500 | 0.1848 | - | - |
428
- | 0 | 0 | - | - | 0.9661 |
429
- | 0.5327 | 2500 | - | 0.1655 | - |
430
- | 0.6392 | 3000 | 0.1317 | - | - |
431
- | 0 | 0 | - | - | 0.9776 |
432
- | 0.6392 | 3000 | - | 0.1162 | - |
433
- | 0.7458 | 3500 | 0.0975 | - | - |
434
- | 0 | 0 | - | - | 0.9815 |
435
- | 0.7458 | 3500 | - | 0.0947 | - |
436
- | 0.8523 | 4000 | 0.0716 | - | - |
437
- | 0 | 0 | - | - | 0.9815 |
438
- | 0.8523 | 4000 | - | 0.0806 | - |
439
- | **0.9589** | **4500** | **0.059** | **-** | **-** |
440
- | 0 | 0 | - | - | 0.9848 |
441
- | **0.9589** | **4500** | **-** | **0.0673** | **-** |
442
-
443
- * The bold row denotes the saved checkpoint.
444
-
445
- ### Framework Versions
446
  - Python: 3.10.12
447
  - Sentence Transformers: 3.4.1
448
  - PyLate: 1.1.7
@@ -452,48 +148,11 @@ You can finetune this model on your own dataset.
452
  - Datasets: 3.3.1
453
  - Tokenizers: 0.21.0
454
 
 
 
455
 
456
- ## Citation
457
-
458
- ### BibTeX
459
-
460
- #### Sentence Transformers
461
- ```bibtex
462
- @inproceedings{reimers-2019-sentence-bert,
463
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
464
- author = "Reimers, Nils and Gurevych, Iryna",
465
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
466
- month = "11",
467
- year = "2019",
468
- publisher = "Association for Computational Linguistics",
469
- url = "https://arxiv.org/abs/1908.10084"
470
- }
471
- ```
472
-
473
- #### PyLate
474
- ```bibtex
475
- @misc{PyLate,
476
- title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
477
- author={Chaffin, Antoine and Sourty, Raphaël},
478
- url={https://github.com/lightonai/pylate},
479
- year={2024}
480
- }
481
- ```
482
-
483
- <!--
484
- ## Glossary
485
-
486
- *Clearly define terms in order to be accessible across audiences.*
487
- -->
488
-
489
- <!--
490
- ## Model Card Authors
491
-
492
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
493
- -->
494
 
495
- <!--
496
- ## Model Card Contact
497
 
498
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
499
- -->
 
28
  - type: accuracy
29
  value: 0.9848384857177734
30
  name: Accuracy
31
+ license: apache-2.0
32
+ language:
33
+ - es
34
+ - en
35
  ---
36
+ [<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/67b2f4e49edebc815a3a4739/R1g957j1aBbx8lhZbWmxw.jpeg" width="200"/>](https://huggingface.co/fjmgAI)
37
 
38
+ ## Fine-Tuned Model
39
 
40
+ **`fjmgAI/col1-210M-EuroBERT`**
41
 
42
+ ## Base Model
43
+ **`EuroBERT/EuroBERT-210m`**
44
 
45
+ ## Fine-Tuning Method
46
+ Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
 
 
 
 
 
 
 
 
 
47
 
48
+ ## Dataset
49
+ **[`baconnier/rag-comprehensive-triplets`](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)**
50
 
51
+ ### Description
52
+ This dataset has been filtered for the Spanish language containing **303,000 examples**, designed for **rag-comprehensive-triplets**.
 
53
 
54
+ ## Fine-Tuning Details
55
+ - The model was trained using the **Contrastive Training**.
56
+ - * Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>
57
 
58
+ | Metric | Value |
59
+ |:-------------|:-----------|
60
+ | **accuracy** | **0.9848** |
 
 
 
61
 
62
  ## Usage
63
  First install the PyLate library:
 
66
  pip install -U pylate
67
  ```
68
 
69
+ ### Calculate Similarity
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ```python
72
+ import torch
73
+ from pylate import models
74
+
75
+ # Load the ColBERT model from Hugging Face Hub
76
+ # 'trust_remote_code=True' is required for custom models like ColBERT
77
+ model = models.ColBERT("fjmgAI/col1-210M-EuroBERT", trust_remote_code=True)
78
+
79
+ # Move the model to GPU if available, otherwise use CPU
80
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
81
+ model.to(device)
82
+
83
+ # Example data for similarity comparison
84
+ query = "¿Cuál es la capital de España?" # Query sentence
85
+ positive_doc = "La capital de España es Madrid." # Relevant document
86
+ negative_doc = "Florida es un estado en los Estados Unidos." # Irrelevant document
87
+ sentences = [query, positive_doc, negative_doc] # Combine all texts
88
+
89
+ # Tokenize the input sentences using ColBERT's tokenizer
90
+ # This converts text to token IDs and attention masks
91
+ inputs = model.tokenize(sentences)
92
+
93
+ # Move all input tensors to the same device as the model (GPU/CPU)
94
+ inputs = {key: value.to(device) for key, value in inputs.items()}
95
+
96
+ # Generate token embeddings (no gradients needed for inference)
97
+ with torch.no_grad():
98
+ # Forward pass through the model
99
+ embeddings_dict = model(inputs) # Returns dictionary with model outputs
100
+
101
+ # Extract token-level embeddings (shape: [batch_size, seq_length, embedding_dim])
102
+ embeddings = embeddings_dict['token_embeddings']
103
+ print(embeddings.shape) # Expected: [3, 32, 128] (3 texts, 32 tokens max, 128-dim embeddings)
104
+
105
+ # Define ColBERT's MaxSim similarity function
106
+ def colbert_similarity(query_emb, doc_emb):
107
+ """
108
+ Computes ColBERT-style similarity between query and document embeddings.
109
+ Uses maximum similarity (MaxSim) between individual tokens.
110
+
111
+ Args:
112
+ query_emb: [query_tokens, embedding_dim]
113
+ doc_emb: [doc_tokens, embedding_dim]
114
+
115
+ Returns:
116
+ Normalized similarity score
117
+ """
118
+ # Compute dot product between all token pairs
119
+ similarity_matrix = torch.matmul(query_emb, doc_emb.T) # [query_tokens, doc_tokens]
120
+
121
+ # Get maximum similarity for each query token (MaxSim)
122
+ max_similarities = similarity_matrix.max(dim=1)[0]
123
+
124
+ # Return average of maximum similarities (normalized by query length)
125
+ return max_similarities.sum() / query_emb.shape[0]
126
+
127
+ # Extract embeddings for each text
128
+ query_emb = embeddings[0] # [32, 128] - Query embeddings
129
+ positive_emb = embeddings[1] # [32, 128] - Positive document embeddings
130
+ negative_emb = embeddings[2] # [32, 128] - Negative document embeddings
131
+
132
+ # Compute similarity scores
133
+ positive_score = colbert_similarity(query_emb, positive_emb) # Query vs positive doc
134
+ negative_score = colbert_similarity(query_emb, negative_emb) # Query vs negative doc
135
+
136
+ # Print results (move scores to CPU first if using GPU)
137
+ print(f"Similarity with positive document: {positive_score.item():.4f}")
138
+ print(f"Similarity with negative document: {negative_score.item():.4f}")
139
  ```
140
 
141
+ ## Framework Versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  - Python: 3.10.12
143
  - Sentence Transformers: 3.4.1
144
  - PyLate: 1.1.7
 
148
  - Datasets: 3.3.1
149
  - Tokenizers: 0.21.0
150
 
151
+ ## Purpose
152
+ This tuned model is designed for **Spanish applications** that require the use of **efficient semantic search** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**.
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
+ - **Developed by:** fjmgAI
156
+ - **License:** apache-2.0
157
 
158
+ [<img src="https://github.com/lightonai/pylate/blob/main/docs/img/logo.png?raw=true" width="200"/>](https://github.com/lightonai/pylate)