Generated from Research Library:

https://github.com/peytontolbert/Research_Library

A3 Abstract Keywords

A3 is a LoRA adapter for google/flan-t5-base trained to generate concise keyword lists from scientific paper abstracts inside the Research Library project.

This repository contains the PEFT adapter only. Load it on top of google/flan-t5-base.

Intended Use

Generate paper keywords from abstracts.
Tag and index scientific papers for search, browsing, and clustering.
Provide lightweight metadata for paper library interfaces.

This model is not intended to summarize full papers or generate scientific claims.

Training Data

The adapter was trained from PeytonT/1m_papers_text, a 1M-paper full-text and metadata dataset. For A3, the training input was the paper abstract and the target was a normalized keyword string derived from available paper metadata and text fields.

Training Procedure

Base model: google/flan-t5-base
Adapter: LoRA
Task type: SEQ_2_SEQ_LM
LoRA rank: 8
LoRA alpha: 32
LoRA dropout: 0.05
Target modules: q, v
Objective: cross-entropy sequence generation
Batch size: 256
Max source tokens: 768
Max target tokens: 128
Precision: bf16
Optimizer: adamw
Learning rate: 1e-4
Warmup steps: 1000
Training steps: 3895
Epochs: 1.0
Last logged train loss: 0.3175
Hardware: single H100 80GB GPU

Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel

repo_id = "PeytonT/1m-papers-abstract-keywords"
base_id = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
base = AutoModelForSeq2SeqLM.from_pretrained(base_id)
model = PeftModel.from_pretrained(base, repo_id)
model.eval()

abstract = (
    "We introduce a graph neural network retrieval system for scientific papers "
    "that jointly embeds citation links, paper abstracts, and author metadata."
)
inputs = tokenizer(abstract, return_tensors="pt", truncation=True, max_length=768)
inputs.pop("token_type_ids", None)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=32, do_sample=False)

keywords = tokenizer.decode(output[0], skip_special_tokens=True)
print(keywords)

Example Outputs

Raw generations may contain duplicate terms. A simple downstream post-process should split on commas, trim whitespace, remove stopwords, and de-duplicate while preserving order.

Input: graph neural retrieval system for scientific papers...
Output: citation, neural, network, abstracts

Input: diffusion transformer for protein structure generation...
Output: protein, structure, diffusion, transformer, equivariant, evaluations

Limitations

The model can repeat keywords and may emit generic words such as that, with, or from without post-processing.
It is optimized for abstract-level keyword generation, not full-paper tagging.
External benchmark evaluation has not yet been run.
Keyword quality depends on the quality and style of the source abstracts and metadata.

Project Context

This model is part of the Research Library system for exploring repositories and scientific papers through search, metadata views, paper graphs, and 3D universe visualizations.

Framework Versions

PEFT 0.19.1

Downloads last month: 27

Model tree for PeytonT/1m-papers-abstract-keywords

Base model

google/flan-t5-base

Adapter

(305)

this model

PeytonT
/

1m-papers-abstract-keywords