bge-small-en-v1.5-ultrafineweb-vs-pile-classifier
Note: This model is provided for reference and reproducibility, not for standalone use.
This model is a fine-tuned version of BAAI/bge-small-en-v1.5 to classify text as high quality or low quality for AI training.
- Trained on 100k samples from openbmb/Ultra-FineWeb (high quality) and 100k from EleutherAI/the_pile_deduplicated (low quality)
- 80% training / 20% validation split
On the validation set:
- Loss: 0.2926
- Accuracy: 0.9061
- Combined Score: 2.1448
- Tokens processed: 102,184,960
Example
from transformers import pipeline
classifier = pipeline("text-classification", model="agentlans/bge-small-en-v1.5-ultrafineweb-vs-pile-classifier")
classifier("Your text here.")
Limitations
- Tends to be overly strict, labelling most texts outside training data as low quality
- English only
- May be biased against some text types such as source code and personal blogs
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 5.0
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy | Combined Score | Input Tokens Seen |
---|---|---|---|---|---|---|
0.2893 | 1.0 | 19958 | 0.2926 | 0.9061 | 2.1448 | 20436992 |
0.2397 | 2.0 | 39916 | 0.3127 | 0.9076 | 2.1194 | 40873984 |
0.2 | 3.0 | 59874 | 0.3279 | 0.9109 | 2.0605 | 61310976 |
0.1576 | 4.0 | 79832 | 0.3887 | 0.9080 | 2.1119 | 81747968 |
0.1127 | 5.0 | 99790 | 0.4688 | 0.9069 | 2.1308 | 102184960 |
Framework versions
- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for agentlans/bge-small-en-v1.5-ultrafineweb-vs-pile-classifier
Base model
BAAI/bge-small-en-v1.5