File size: 3,069 Bytes
41e77b9 33917d7 7d464d9 41e77b9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
---
datasets:
- sentence-transformers/msmarco
---
# ⭐ [GLiClass](https://github.com/Knowledgator/GLiClass): Generalist and Lightweight Model for Sequence Classification
This is an efficient zero-shot classifier inspired by [GLiNER](https://github.com/urchade/GLiNER/tree/main) work. It demonstrates the same performance as a cross-encoder while being more compute-efficient because classification is done at a single forward path.
It can be used for `topic classification`, `sentiment analysis` and as a reranker in `RAG` pipelines.
The model was trained on synthetic and licensed data that allow commercial use and can be used in commercial applications.
This version of the model uses a layer-wise selection of features that enables a better understanding of different levels of language. The backbone model is [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), which effectively processes long sequences.
### How to use:
First of all, you need to install GLiClass library:
```bash
pip install gliclass
pip install -U transformers>=4.48.0
```
Than you need to initialize a model and a pipeline:
```python
from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer
model = GLiClassModel.from_pretrained("alexandrlukashov/gliclass_msmarco_merged")
tokenizer = AutoTokenizer.from_pretrained("alexandrlukashov/gliclass_msmarco_merged", add_prefix_space=True)
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')
text = "I want to live in New York."
labels =[
'York is a cathedral city in North Yorkshire, England, with Roman origins',
'San Francisco,[23] officially the City and County of San Francisco, is a commercial, financial, and cultural center within Northern California, United States.',
'New York, often called New York City (NYC),[b] is the most populous city in the United States',
"New York City is the third album by electronica group Brazilian Girls, released in 2008.",
"New York City was an American R&B vocal group.",
"New York City is an album by the Peter Malick Group featuring Norah Jones.",
"New York City: The Album is the debut studio album by American rapper Troy Ave. ",
'"New York City" is a song by British new wave band The Armoury Show',
]
results = pipeline(text, labels, threshold=0.5)[0] #because we have one text
for result in results:
print(result["label"], "=>", result["score"])
```
### Benchmarking:
| Dataset | Base NDCG@10 | GLiClass NDCG@10 |
|---------|-------------|------------------|
| NanoArguAna | 0.489 | 0.525 |
| NanoClimateFEVER | 0.318 | 0.870 |
| NanoDBPedia | 0.614 | 0.871 |
| NanoFEVER | 0.809 | 0.770 |
| NanoFiQA2018 | 0.437 | 0.719 |
| NanoHotpotQA | 0.828 | 0.647 |
| NanoMSMARCO | 0.540 | 0.445 |
| NanoNFCorpus | 0.325 | 0.710 |
| NanoNQ | 0.501 | 0.588 |
| NanoQuoraRetrieval | 0.869 | 0.540 |
| NanoSCIDOCS | 0.335 | 0.917 |
| NanoSciFact | 0.710 | 0.652 |
| NanoTouche2020 | 0.694 | 0.490 |
| **NanoBEIR (mean)** | 0.574 | **0.673** | |