You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Books Named Entity Recognition Model

This model specialises in recognising book titles and author names in short, user‑typed queries. It achieves > 92 % F1 on a held‑out evaluation set.

Check out this model in action in this experience!

Gutenberg AI Search by Empathy.ai


1 Provenance

This model's provenance originates from fine-tuning gliner-community/gliner_large‑v2.5 using synthetic query data derived from Project Gutenberg's public catalogue.


2  Use‑case

This model is a drop‑in replacement for generic GLiNER when your text stream revolves around bibliographic requests such as:

“Looking for Dune from Frank Herbert.”
“Any recommendations by Mary Shelley?”

Typical applications:

  • Query understanding in library / e‑book search engines
  • Post‑processing LLM output to structure reading lists
  • Digital humanities pipelines that need lightweight title/author extraction

Not suitable for: recognising publishers, ISBNs or long BIB‑style references (only short queries were used for training).


3  Performance

Metric Overall title author
Precision 0.9999 0.9999 0.9999
Recall 0.8583 0.7661 0.9287
F1‑score 0.9237 0.8675 0.9630
Support 69 880 30 290 39 590

Evaluation dataset: 43 493 queries (English‑only) held out from the training corpus. Prediction threshold = 0.2.


4  Quick start

from gliner import GLiNER

model = GLiNER.from_pretrained("empathyai/gliner_large-v2.5-books")
text = "Looking for The Man in the High Castle by Philip K. Dick."

entities = model.predict_entities(text, ["title", "author"], threshold=0.2)
print(entities)
# [{'text': 'The Man in the High Castle', 'label': 'title', 'score': 0.99},
#  {'text': 'Philip K. Dick', 'label': 'author', 'score': 0.99}]

5  Training details

  • Base model: gliner_large-v2.5 (≈ 459 M parameters, Creative Commons Zero v1.0 Universal)
  • Dataset: empathyai/books-ner-dataset — 435 k synthetic English queries (titles + authors only)
  • Splits: 391 432 train / 43 493 eval (duplicates removed)
  • Script highlights
    • Learning rate 5 × 10⁻⁶, linear schedule, warm‑up 10 %
    • Batch 32, gradient accumulation 2, focal loss α 0.75 / γ 2
    • 1 epoch
    • Gradient checkpointing + BF16 for memory efficiency
    • Trained on a single L40S; total wall time ≈ 40 min

6  Limitations & bias

  • The vocabulary of titles/authors comes from Project Gutenberg (public‑domain heavy; modern best‑sellers may be unseen).
  • Only short, informal English queries were simulated. Long paragraphs or non‑English text may degrade accuracy.
  • Does not tag publishers, dates, ISBNs, or other bibliographic fields.

7  Acknowledgements

Thanks to the GLiNER authors and maintainers; HuggingFace for hosting; Project Gutenberg volunteers for the free metadata.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for empathyai/gliner_large-v2.5-books

Finetuned
(3)
this model
Finetunes
1 model

Dataset used to train empathyai/gliner_large-v2.5-books