File size: 3,069 Bytes
41e77b9
 
 
 
33917d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d464d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41e77b9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
datasets:
- sentence-transformers/msmarco
---
# ⭐ [GLiClass](https://github.com/Knowledgator/GLiClass): Generalist and Lightweight Model for Sequence Classification

This is an efficient zero-shot classifier inspired by [GLiNER](https://github.com/urchade/GLiNER/tree/main) work. It demonstrates the same performance as a cross-encoder while being more compute-efficient because classification is done at a single forward path.

It can be used for `topic classification`, `sentiment analysis` and as a reranker in `RAG` pipelines.

The model was trained on synthetic and licensed data that allow commercial use and can be used in commercial applications.

This version of the model uses a layer-wise selection of features that enables a better understanding of different levels of language. The backbone model is [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), which effectively processes long sequences.

### How to use:
First of all, you need to install GLiClass library:
```bash
pip install gliclass
pip install -U transformers>=4.48.0
```

Than you need to initialize a model and a pipeline:

```python
from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

model = GLiClassModel.from_pretrained("alexandrlukashov/gliclass_msmarco_merged")
tokenizer = AutoTokenizer.from_pretrained("alexandrlukashov/gliclass_msmarco_merged", add_prefix_space=True)
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

text = "I want to live in New York."
labels =[
    'York is a cathedral city in North Yorkshire, England, with Roman origins',
    'San Francisco,[23] officially the City and County of San Francisco, is a commercial, financial, and cultural center within Northern California, United States.',
    'New York, often called New York City (NYC),[b] is the most populous city in the United States',
    "New York City is the third album by electronica group Brazilian Girls, released in 2008.",
    "New York City was an American R&B vocal group.",
    "New York City is an album by the Peter Malick Group featuring Norah Jones.",
    "New York City: The Album is the debut studio album by American rapper Troy Ave. ",
    '"New York City" is a song by British new wave band The Armoury Show',
]
results = pipeline(text, labels, threshold=0.5)[0] #because we have one text
for result in results:
 print(result["label"], "=>", result["score"])
```

### Benchmarking:
| Dataset | Base NDCG@10 | GLiClass NDCG@10 |
|---------|-------------|------------------|
| NanoArguAna | 0.489 | 0.525 |
| NanoClimateFEVER | 0.318 | 0.870 |
| NanoDBPedia | 0.614 | 0.871 |
| NanoFEVER | 0.809 | 0.770 |
| NanoFiQA2018 | 0.437 | 0.719 |
| NanoHotpotQA | 0.828 | 0.647 |
| NanoMSMARCO | 0.540 | 0.445 |
| NanoNFCorpus | 0.325 | 0.710 |
| NanoNQ | 0.501 | 0.588 |
| NanoQuoraRetrieval | 0.869 | 0.540 |
| NanoSCIDOCS | 0.335 | 0.917 |
| NanoSciFact | 0.710 | 0.652 |
| NanoTouche2020 | 0.694 | 0.490 |
| **NanoBEIR (mean)** | 0.574 | **0.673** |