finetuned with additional names
Browse files- README.md +33 -33
- model.safetensors +1 -1
README.md
CHANGED
@@ -6,19 +6,20 @@ tags:
|
|
6 |
- generated_from_trainer
|
7 |
- dataset_size:30415
|
8 |
- loss:BinaryCrossEntropyLoss
|
|
|
9 |
pipeline_tag: text-ranking
|
10 |
library_name: sentence-transformers
|
11 |
---
|
12 |
|
13 |
-
# CrossEncoder
|
14 |
|
15 |
-
This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model
|
16 |
|
17 |
## Model Details
|
18 |
|
19 |
### Model Description
|
20 |
- **Model Type:** Cross Encoder
|
21 |
-
|
22 |
- **Maximum Sequence Length:** 512 tokens
|
23 |
- **Number of Output Labels:** 1 label
|
24 |
<!-- - **Training Dataset:** Unknown -->
|
@@ -50,11 +51,11 @@ from sentence_transformers import CrossEncoder
|
|
50 |
model = CrossEncoder("foochun/bge-reranker-ft-v2")
|
51 |
# Get scores for pairs of texts
|
52 |
pairs = [
|
53 |
-
['
|
54 |
-
['
|
55 |
-
['
|
56 |
-
['
|
57 |
-
['
|
58 |
]
|
59 |
scores = model.predict(pairs)
|
60 |
print(scores.shape)
|
@@ -62,13 +63,13 @@ print(scores.shape)
|
|
62 |
|
63 |
# Or rank different texts based on similarity to a single text
|
64 |
ranks = model.rank(
|
65 |
-
'
|
66 |
[
|
67 |
-
'
|
68 |
-
'
|
69 |
-
'
|
70 |
-
'
|
71 |
-
'
|
72 |
]
|
73 |
)
|
74 |
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
|
@@ -119,16 +120,16 @@ You can finetune this model on your own dataset.
|
|
119 |
* Size: 30,415 training samples
|
120 |
* Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
|
121 |
* Approximate statistics based on the first 1000 samples:
|
122 |
-
| | sentence_0
|
123 |
-
|
124 |
-
| type | string
|
125 |
-
| details | <ul><li>min:
|
126 |
* Samples:
|
127 |
-
| sentence_0
|
128 |
-
|
129 |
-
| <code>
|
130 |
-
| <code>
|
131 |
-
| <code>
|
132 |
* Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
|
133 |
```json
|
134 |
{
|
@@ -143,7 +144,6 @@ You can finetune this model on your own dataset.
|
|
143 |
- `per_device_train_batch_size`: 64
|
144 |
- `per_device_eval_batch_size`: 64
|
145 |
- `num_train_epochs`: 5
|
146 |
-
- `fp16`: True
|
147 |
|
148 |
#### All Hyperparameters
|
149 |
<details><summary>Click to expand</summary>
|
@@ -187,7 +187,7 @@ You can finetune this model on your own dataset.
|
|
187 |
- `jit_mode_eval`: False
|
188 |
- `use_ipex`: False
|
189 |
- `bf16`: False
|
190 |
-
- `fp16`:
|
191 |
- `fp16_opt_level`: O1
|
192 |
- `half_precision_backend`: auto
|
193 |
- `bf16_full_eval`: False
|
@@ -271,19 +271,19 @@ You can finetune this model on your own dataset.
|
|
271 |
### Training Logs
|
272 |
| Epoch | Step | Training Loss |
|
273 |
|:------:|:----:|:-------------:|
|
274 |
-
| 1.0504 | 500 | 0.
|
275 |
-
| 2.1008 | 1000 | 0.
|
276 |
-
| 3.1513 | 1500 | 0.
|
277 |
-
| 4.2017 | 2000 | 0.
|
278 |
|
279 |
|
280 |
### Framework Versions
|
281 |
-
- Python: 3.
|
282 |
- Sentence Transformers: 5.0.0
|
283 |
-
- Transformers: 4.53.
|
284 |
- PyTorch: 2.6.0+cu124
|
285 |
-
- Accelerate: 1.
|
286 |
-
- Datasets:
|
287 |
- Tokenizers: 0.21.2
|
288 |
|
289 |
## Citation
|
|
|
6 |
- generated_from_trainer
|
7 |
- dataset_size:30415
|
8 |
- loss:BinaryCrossEntropyLoss
|
9 |
+
base_model: cross-encoder/stsb-roberta-base
|
10 |
pipeline_tag: text-ranking
|
11 |
library_name: sentence-transformers
|
12 |
---
|
13 |
|
14 |
+
# CrossEncoder based on cross-encoder/stsb-roberta-base
|
15 |
|
16 |
+
This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [cross-encoder/stsb-roberta-base](https://huggingface.co/cross-encoder/stsb-roberta-base) using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
|
17 |
|
18 |
## Model Details
|
19 |
|
20 |
### Model Description
|
21 |
- **Model Type:** Cross Encoder
|
22 |
+
- **Base model:** [cross-encoder/stsb-roberta-base](https://huggingface.co/cross-encoder/stsb-roberta-base) <!-- at revision d576534b67143e2c70ee9966d7fdbf5835728d13 -->
|
23 |
- **Maximum Sequence Length:** 512 tokens
|
24 |
- **Number of Output Labels:** 1 label
|
25 |
<!-- - **Training Dataset:** Unknown -->
|
|
|
51 |
model = CrossEncoder("foochun/bge-reranker-ft-v2")
|
52 |
# Get scores for pairs of texts
|
53 |
pairs = [
|
54 |
+
['chitra nadarajah', 'chitra a/p nadarajah'],
|
55 |
+
['nik azlina binti nik din', 'norhayati binti mustafa'],
|
56 |
+
['soh min pek', 'pek soh min'],
|
57 |
+
['nurul hazimah binti januiddi', 'salmah binti alias'],
|
58 |
+
['afiq muiz bin azman shah', 'elyana binti emrizal'],
|
59 |
]
|
60 |
scores = model.predict(pairs)
|
61 |
print(scores.shape)
|
|
|
63 |
|
64 |
# Or rank different texts based on similarity to a single text
|
65 |
ranks = model.rank(
|
66 |
+
'chitra nadarajah',
|
67 |
[
|
68 |
+
'chitra a/p nadarajah',
|
69 |
+
'norhayati binti mustafa',
|
70 |
+
'pek soh min',
|
71 |
+
'salmah binti alias',
|
72 |
+
'elyana binti emrizal',
|
73 |
]
|
74 |
)
|
75 |
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
|
|
|
120 |
* Size: 30,415 training samples
|
121 |
* Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
|
122 |
* Approximate statistics based on the first 1000 samples:
|
123 |
+
| | sentence_0 | sentence_1 | label |
|
124 |
+
|:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------|
|
125 |
+
| type | string | string | float |
|
126 |
+
| details | <ul><li>min: 9 characters</li><li>mean: 21.12 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 19.14 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 0.55</li><li>mean: 0.73</li><li>max: 1.0</li></ul> |
|
127 |
* Samples:
|
128 |
+
| sentence_0 | sentence_1 | label |
|
129 |
+
|:--------------------------------------|:-------------------------------------|:------------------|
|
130 |
+
| <code>chitra nadarajah</code> | <code>chitra a/p nadarajah</code> | <code>0.9</code> |
|
131 |
+
| <code>nik azlina binti nik din</code> | <code>norhayati binti mustafa</code> | <code>0.55</code> |
|
132 |
+
| <code>soh min pek</code> | <code>pek soh min</code> | <code>0.9</code> |
|
133 |
* Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
|
134 |
```json
|
135 |
{
|
|
|
144 |
- `per_device_train_batch_size`: 64
|
145 |
- `per_device_eval_batch_size`: 64
|
146 |
- `num_train_epochs`: 5
|
|
|
147 |
|
148 |
#### All Hyperparameters
|
149 |
<details><summary>Click to expand</summary>
|
|
|
187 |
- `jit_mode_eval`: False
|
188 |
- `use_ipex`: False
|
189 |
- `bf16`: False
|
190 |
+
- `fp16`: False
|
191 |
- `fp16_opt_level`: O1
|
192 |
- `half_precision_backend`: auto
|
193 |
- `bf16_full_eval`: False
|
|
|
271 |
### Training Logs
|
272 |
| Epoch | Step | Training Loss |
|
273 |
|:------:|:----:|:-------------:|
|
274 |
+
| 1.0504 | 500 | 0.4846 |
|
275 |
+
| 2.1008 | 1000 | 0.4684 |
|
276 |
+
| 3.1513 | 1500 | 0.4673 |
|
277 |
+
| 4.2017 | 2000 | 0.4668 |
|
278 |
|
279 |
|
280 |
### Framework Versions
|
281 |
+
- Python: 3.10.18
|
282 |
- Sentence Transformers: 5.0.0
|
283 |
+
- Transformers: 4.53.2
|
284 |
- PyTorch: 2.6.0+cu124
|
285 |
+
- Accelerate: 1.9.0
|
286 |
+
- Datasets: 4.0.0
|
287 |
- Tokenizers: 0.21.2
|
288 |
|
289 |
## Citation
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 498609748
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a1723dd01ed39501db798471c5f74ef3924865dd965ffb858e32fbca6b5edc6a
|
3 |
size 498609748
|