Generated from Research Library:

https://github.com/peytontolbert/Research_Library

A3 Abstract Keywords

A3 is a LoRA adapter for google/flan-t5-base trained to generate concise keyword lists from scientific paper abstracts inside the Research Library project.

This repository contains the PEFT adapter only. Load it on top of google/flan-t5-base.

Intended Use

  • Generate paper keywords from abstracts.
  • Tag and index scientific papers for search, browsing, and clustering.
  • Provide lightweight metadata for paper library interfaces.

This model is not intended to summarize full papers or generate scientific claims.

Training Data

The adapter was trained from PeytonT/1m_papers_text, a 1M-paper full-text and metadata dataset. For A3, the training input was the paper abstract and the target was a normalized keyword string derived from available paper metadata and text fields.

Training Procedure

  • Base model: google/flan-t5-base
  • Adapter: LoRA
  • Task type: SEQ_2_SEQ_LM
  • LoRA rank: 8
  • LoRA alpha: 32
  • LoRA dropout: 0.05
  • Target modules: q, v
  • Objective: cross-entropy sequence generation
  • Batch size: 256
  • Max source tokens: 768
  • Max target tokens: 128
  • Precision: bf16
  • Optimizer: adamw
  • Learning rate: 1e-4
  • Warmup steps: 1000
  • Training steps: 3895
  • Epochs: 1.0
  • Last logged train loss: 0.3175
  • Hardware: single H100 80GB GPU

Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel

repo_id = "PeytonT/1m-papers-abstract-keywords"
base_id = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
base = AutoModelForSeq2SeqLM.from_pretrained(base_id)
model = PeftModel.from_pretrained(base, repo_id)
model.eval()

abstract = (
    "We introduce a graph neural network retrieval system for scientific papers "
    "that jointly embeds citation links, paper abstracts, and author metadata."
)
inputs = tokenizer(abstract, return_tensors="pt", truncation=True, max_length=768)
inputs.pop("token_type_ids", None)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=32, do_sample=False)

keywords = tokenizer.decode(output[0], skip_special_tokens=True)
print(keywords)

Example Outputs

Raw generations may contain duplicate terms. A simple downstream post-process should split on commas, trim whitespace, remove stopwords, and de-duplicate while preserving order.

Input: graph neural retrieval system for scientific papers...
Output: citation, neural, network, abstracts

Input: diffusion transformer for protein structure generation...
Output: protein, structure, diffusion, transformer, equivariant, evaluations

Limitations

  • The model can repeat keywords and may emit generic words such as that, with, or from without post-processing.
  • It is optimized for abstract-level keyword generation, not full-paper tagging.
  • External benchmark evaluation has not yet been run.
  • Keyword quality depends on the quality and style of the source abstracts and metadata.

Project Context

This model is part of the Research Library system for exploring repositories and scientific papers through search, metadata views, paper graphs, and 3D universe visualizations.

Framework Versions

  • PEFT 0.19.1
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PeytonT/1m-papers-abstract-keywords

Adapter
(305)
this model

Dataset used to train PeytonT/1m-papers-abstract-keywords