knowledgator
/

gliclass_msmarco_merged

Model card Files Files and versions

gliclass_msmarco_merged / README.md

alexandrlukashov's picture

alexandrlukashov

Update README.md

41e77b9 verified 7 months ago

|

history blame contribute delete

3.07 kB

	---
	datasets:
	- sentence-transformers/msmarco
	---
	# ⭐ [GLiClass](https://github.com/Knowledgator/GLiClass): Generalist and Lightweight Model for Sequence Classification

	This is an efficient zero-shot classifier inspired by [GLiNER](https://github.com/urchade/GLiNER/tree/main) work. It demonstrates the same performance as a cross-encoder while being more compute-efficient because classification is done at a single forward path.

	It can be used for `topic classification`, `sentiment analysis` and as a reranker in `RAG` pipelines.

	The model was trained on synthetic and licensed data that allow commercial use and can be used in commercial applications.

	This version of the model uses a layer-wise selection of features that enables a better understanding of different levels of language. The backbone model is [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), which effectively processes long sequences.

	### How to use:
	First of all, you need to install GLiClass library:
	```bash
	pip install gliclass
	pip install -U transformers>=4.48.0
	```

	Than you need to initialize a model and a pipeline:

	```python
	from gliclass import GLiClassModel, ZeroShotClassificationPipeline
	from transformers import AutoTokenizer

	model = GLiClassModel.from_pretrained("alexandrlukashov/gliclass_msmarco_merged")
	tokenizer = AutoTokenizer.from_pretrained("alexandrlukashov/gliclass_msmarco_merged", add_prefix_space=True)
	pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

	text = "I want to live in New York."
	labels =[
	'York is a cathedral city in North Yorkshire, England, with Roman origins',
	'San Francisco,[23] officially the City and County of San Francisco, is a commercial, financial, and cultural center within Northern California, United States.',
	'New York, often called New York City (NYC),[b] is the most populous city in the United States',
	"New York City is the third album by electronica group Brazilian Girls, released in 2008.",
	"New York City was an American R&B vocal group.",
	"New York City is an album by the Peter Malick Group featuring Norah Jones.",
	"New York City: The Album is the debut studio album by American rapper Troy Ave. ",
	'"New York City" is a song by British new wave band The Armoury Show',
	]
	results = pipeline(text, labels, threshold=0.5)[0] #because we have one text
	for result in results:
	print(result["label"], "=>", result["score"])
	```

	### Benchmarking:
	\| Dataset \| Base NDCG@10 \| GLiClass NDCG@10 \|
	\|---------\|-------------\|------------------\|
	\| NanoArguAna \| 0.489 \| 0.525 \|
	\| NanoClimateFEVER \| 0.318 \| 0.870 \|
	\| NanoDBPedia \| 0.614 \| 0.871 \|
	\| NanoFEVER \| 0.809 \| 0.770 \|
	\| NanoFiQA2018 \| 0.437 \| 0.719 \|
	\| NanoHotpotQA \| 0.828 \| 0.647 \|
	\| NanoMSMARCO \| 0.540 \| 0.445 \|
	\| NanoNFCorpus \| 0.325 \| 0.710 \|
	\| NanoNQ \| 0.501 \| 0.588 \|
	\| NanoQuoraRetrieval \| 0.869 \| 0.540 \|
	\| NanoSCIDOCS \| 0.335 \| 0.917 \|
	\| NanoSciFact \| 0.710 \| 0.652 \|
	\| NanoTouche2020 \| 0.694 \| 0.490 \|
	\| NanoBEIR (mean) \| 0.574 \| 0.673 \|