foochun commited on
Commit
c67c5b5
·
verified ·
1 Parent(s): 76a7d5b

finetuned with additional names

Browse files
Files changed (2) hide show
  1. README.md +33 -33
  2. model.safetensors +1 -1
README.md CHANGED
@@ -6,19 +6,20 @@ tags:
6
  - generated_from_trainer
7
  - dataset_size:30415
8
  - loss:BinaryCrossEntropyLoss
 
9
  pipeline_tag: text-ranking
10
  library_name: sentence-transformers
11
  ---
12
 
13
- # CrossEncoder
14
 
15
- This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model trained using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
16
 
17
  ## Model Details
18
 
19
  ### Model Description
20
  - **Model Type:** Cross Encoder
21
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
22
  - **Maximum Sequence Length:** 512 tokens
23
  - **Number of Output Labels:** 1 label
24
  <!-- - **Training Dataset:** Unknown -->
@@ -50,11 +51,11 @@ from sentence_transformers import CrossEncoder
50
  model = CrossEncoder("foochun/bge-reranker-ft-v2")
51
  # Get scores for pairs of texts
52
  pairs = [
53
- ['ruben s/o veerasamy', 'ruben a/p veerasamy'],
54
- ['aravind a/l krishnamurthy', 'krishnamurthy s/o aravind'],
55
- ['nazmi bin ishak', 'siti rohani binti kassim'],
56
- ['lena yap zhen liang', 'liang zhen yap'],
57
- ['radhika a/p krishnan', 'radhika a/p arun'],
58
  ]
59
  scores = model.predict(pairs)
60
  print(scores.shape)
@@ -62,13 +63,13 @@ print(scores.shape)
62
 
63
  # Or rank different texts based on similarity to a single text
64
  ranks = model.rank(
65
- 'ruben s/o veerasamy',
66
  [
67
- 'ruben a/p veerasamy',
68
- 'krishnamurthy s/o aravind',
69
- 'siti rohani binti kassim',
70
- 'liang zhen yap',
71
- 'radhika a/p arun',
72
  ]
73
  )
74
  # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
@@ -119,16 +120,16 @@ You can finetune this model on your own dataset.
119
  * Size: 30,415 training samples
120
  * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
121
  * Approximate statistics based on the first 1000 samples:
122
- | | sentence_0 | sentence_1 | label |
123
- |:--------|:-----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------|
124
- | type | string | string | float |
125
- | details | <ul><li>min: 10 characters</li><li>mean: 20.81 characters</li><li>max: 45 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 19.08 characters</li><li>max: 46 characters</li></ul> | <ul><li>min: 0.55</li><li>mean: 0.74</li><li>max: 1.0</li></ul> |
126
  * Samples:
127
- | sentence_0 | sentence_1 | label |
128
- |:---------------------------------------|:---------------------------------------|:-------------------|
129
- | <code>ruben s/o veerasamy</code> | <code>ruben a/p veerasamy</code> | <code>0.623</code> |
130
- | <code>aravind a/l krishnamurthy</code> | <code>krishnamurthy s/o aravind</code> | <code>0.55</code> |
131
- | <code>nazmi bin ishak</code> | <code>siti rohani binti kassim</code> | <code>0.55</code> |
132
  * Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
133
  ```json
134
  {
@@ -143,7 +144,6 @@ You can finetune this model on your own dataset.
143
  - `per_device_train_batch_size`: 64
144
  - `per_device_eval_batch_size`: 64
145
  - `num_train_epochs`: 5
146
- - `fp16`: True
147
 
148
  #### All Hyperparameters
149
  <details><summary>Click to expand</summary>
@@ -187,7 +187,7 @@ You can finetune this model on your own dataset.
187
  - `jit_mode_eval`: False
188
  - `use_ipex`: False
189
  - `bf16`: False
190
- - `fp16`: True
191
  - `fp16_opt_level`: O1
192
  - `half_precision_backend`: auto
193
  - `bf16_full_eval`: False
@@ -271,19 +271,19 @@ You can finetune this model on your own dataset.
271
  ### Training Logs
272
  | Epoch | Step | Training Loss |
273
  |:------:|:----:|:-------------:|
274
- | 1.0504 | 500 | 0.4667 |
275
- | 2.1008 | 1000 | 0.4669 |
276
- | 3.1513 | 1500 | 0.4658 |
277
- | 4.2017 | 2000 | 0.4664 |
278
 
279
 
280
  ### Framework Versions
281
- - Python: 3.11.9
282
  - Sentence Transformers: 5.0.0
283
- - Transformers: 4.53.0
284
  - PyTorch: 2.6.0+cu124
285
- - Accelerate: 1.8.1
286
- - Datasets: 3.6.0
287
  - Tokenizers: 0.21.2
288
 
289
  ## Citation
 
6
  - generated_from_trainer
7
  - dataset_size:30415
8
  - loss:BinaryCrossEntropyLoss
9
+ base_model: cross-encoder/stsb-roberta-base
10
  pipeline_tag: text-ranking
11
  library_name: sentence-transformers
12
  ---
13
 
14
+ # CrossEncoder based on cross-encoder/stsb-roberta-base
15
 
16
+ This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [cross-encoder/stsb-roberta-base](https://huggingface.co/cross-encoder/stsb-roberta-base) using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
17
 
18
  ## Model Details
19
 
20
  ### Model Description
21
  - **Model Type:** Cross Encoder
22
+ - **Base model:** [cross-encoder/stsb-roberta-base](https://huggingface.co/cross-encoder/stsb-roberta-base) <!-- at revision d576534b67143e2c70ee9966d7fdbf5835728d13 -->
23
  - **Maximum Sequence Length:** 512 tokens
24
  - **Number of Output Labels:** 1 label
25
  <!-- - **Training Dataset:** Unknown -->
 
51
  model = CrossEncoder("foochun/bge-reranker-ft-v2")
52
  # Get scores for pairs of texts
53
  pairs = [
54
+ ['chitra nadarajah', 'chitra a/p nadarajah'],
55
+ ['nik azlina binti nik din', 'norhayati binti mustafa'],
56
+ ['soh min pek', 'pek soh min'],
57
+ ['nurul hazimah binti januiddi', 'salmah binti alias'],
58
+ ['afiq muiz bin azman shah', 'elyana binti emrizal'],
59
  ]
60
  scores = model.predict(pairs)
61
  print(scores.shape)
 
63
 
64
  # Or rank different texts based on similarity to a single text
65
  ranks = model.rank(
66
+ 'chitra nadarajah',
67
  [
68
+ 'chitra a/p nadarajah',
69
+ 'norhayati binti mustafa',
70
+ 'pek soh min',
71
+ 'salmah binti alias',
72
+ 'elyana binti emrizal',
73
  ]
74
  )
75
  # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
 
120
  * Size: 30,415 training samples
121
  * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
122
  * Approximate statistics based on the first 1000 samples:
123
+ | | sentence_0 | sentence_1 | label |
124
+ |:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------|
125
+ | type | string | string | float |
126
+ | details | <ul><li>min: 9 characters</li><li>mean: 21.12 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 19.14 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 0.55</li><li>mean: 0.73</li><li>max: 1.0</li></ul> |
127
  * Samples:
128
+ | sentence_0 | sentence_1 | label |
129
+ |:--------------------------------------|:-------------------------------------|:------------------|
130
+ | <code>chitra nadarajah</code> | <code>chitra a/p nadarajah</code> | <code>0.9</code> |
131
+ | <code>nik azlina binti nik din</code> | <code>norhayati binti mustafa</code> | <code>0.55</code> |
132
+ | <code>soh min pek</code> | <code>pek soh min</code> | <code>0.9</code> |
133
  * Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
134
  ```json
135
  {
 
144
  - `per_device_train_batch_size`: 64
145
  - `per_device_eval_batch_size`: 64
146
  - `num_train_epochs`: 5
 
147
 
148
  #### All Hyperparameters
149
  <details><summary>Click to expand</summary>
 
187
  - `jit_mode_eval`: False
188
  - `use_ipex`: False
189
  - `bf16`: False
190
+ - `fp16`: False
191
  - `fp16_opt_level`: O1
192
  - `half_precision_backend`: auto
193
  - `bf16_full_eval`: False
 
271
  ### Training Logs
272
  | Epoch | Step | Training Loss |
273
  |:------:|:----:|:-------------:|
274
+ | 1.0504 | 500 | 0.4846 |
275
+ | 2.1008 | 1000 | 0.4684 |
276
+ | 3.1513 | 1500 | 0.4673 |
277
+ | 4.2017 | 2000 | 0.4668 |
278
 
279
 
280
  ### Framework Versions
281
+ - Python: 3.10.18
282
  - Sentence Transformers: 5.0.0
283
+ - Transformers: 4.53.2
284
  - PyTorch: 2.6.0+cu124
285
+ - Accelerate: 1.9.0
286
+ - Datasets: 4.0.0
287
  - Tokenizers: 0.21.2
288
 
289
  ## Citation
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:59a5c1e90b355682d5319edde509f646c127eea8493dcc7ff63bc13b0dc648d5
3
  size 498609748
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1723dd01ed39501db798471c5f74ef3924865dd965ffb858e32fbca6b5edc6a
3
  size 498609748