Instructions to use PeytonT/1m-papers-abstract-keywords with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use PeytonT/1m-papers-abstract-keywords with PEFT:
from peft import PeftModel from transformers import AutoModelForSeq2SeqLM base_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base") model = PeftModel.from_pretrained(base_model, "PeytonT/1m-papers-abstract-keywords") - Notebooks
- Google Colab
- Kaggle
Generated from Research Library:
https://github.com/peytontolbert/Research_Library
A3 Abstract Keywords
A3 is a LoRA adapter for google/flan-t5-base trained to generate concise keyword lists from scientific paper abstracts inside the Research Library project.
This repository contains the PEFT adapter only. Load it on top of google/flan-t5-base.
Intended Use
- Generate paper keywords from abstracts.
- Tag and index scientific papers for search, browsing, and clustering.
- Provide lightweight metadata for paper library interfaces.
This model is not intended to summarize full papers or generate scientific claims.
Training Data
The adapter was trained from PeytonT/1m_papers_text, a 1M-paper full-text and metadata dataset. For A3, the training input was the paper abstract and the target was a normalized keyword string derived from available paper metadata and text fields.
Training Procedure
- Base model:
google/flan-t5-base - Adapter: LoRA
- Task type:
SEQ_2_SEQ_LM - LoRA rank:
8 - LoRA alpha:
32 - LoRA dropout:
0.05 - Target modules:
q,v - Objective: cross-entropy sequence generation
- Batch size:
256 - Max source tokens:
768 - Max target tokens:
128 - Precision:
bf16 - Optimizer:
adamw - Learning rate:
1e-4 - Warmup steps:
1000 - Training steps:
3895 - Epochs:
1.0 - Last logged train loss:
0.3175 - Hardware: single H100 80GB GPU
Usage
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
repo_id = "PeytonT/1m-papers-abstract-keywords"
base_id = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
base = AutoModelForSeq2SeqLM.from_pretrained(base_id)
model = PeftModel.from_pretrained(base, repo_id)
model.eval()
abstract = (
"We introduce a graph neural network retrieval system for scientific papers "
"that jointly embeds citation links, paper abstracts, and author metadata."
)
inputs = tokenizer(abstract, return_tensors="pt", truncation=True, max_length=768)
inputs.pop("token_type_ids", None)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=32, do_sample=False)
keywords = tokenizer.decode(output[0], skip_special_tokens=True)
print(keywords)
Example Outputs
Raw generations may contain duplicate terms. A simple downstream post-process should split on commas, trim whitespace, remove stopwords, and de-duplicate while preserving order.
Input: graph neural retrieval system for scientific papers...
Output: citation, neural, network, abstracts
Input: diffusion transformer for protein structure generation...
Output: protein, structure, diffusion, transformer, equivariant, evaluations
Limitations
- The model can repeat keywords and may emit generic words such as
that,with, orfromwithout post-processing. - It is optimized for abstract-level keyword generation, not full-paper tagging.
- External benchmark evaluation has not yet been run.
- Keyword quality depends on the quality and style of the source abstracts and metadata.
Project Context
This model is part of the Research Library system for exploring repositories and scientific papers through search, metadata views, paper graphs, and 3D universe visualizations.
Framework Versions
- PEFT
0.19.1
- Downloads last month
- 27
Model tree for PeytonT/1m-papers-abstract-keywords
Base model
google/flan-t5-base