File size: 2,536 Bytes

a75887a
 
7013ce5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a75887a
 
7013ce5
a75887a
7013ce5
a75887a
7013ce5
a75887a
7013ce5
a75887a
7013ce5
a75887a
7013ce5
 
 
 
 
 
 
 
 
 
a75887a
7013ce5
a75887a
7013ce5
a75887a
7013ce5
 
a75887a
7013ce5
 
 
a75887a
7013ce5
a75887a
7013ce5
a75887a
7013ce5
 
a75887a
7013ce5
a75887a
7013ce5
a75887a
7013ce5
 
 
 
 
 
 
 
a75887a
7013ce5
a75887a
7013ce5
a75887a
7013ce5
 
a75887a
7013ce5
 
a75887a
7013ce5
 
a75887a
7013ce5

---
library_name: transformers
tags:
- text-generation
- paraphrase
- gpt2
- causal-lm
- transformers
- pytorch
license: mit
datasets:
- HHousen/ParaSCI
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
---

# Model Card for `gpt2-parasciparaphrase`

## 🧠 Model Summary

This model is a fine-tuned version of [GPT-2](https://huggingface.co/gpt2) on the [ParaSCI dataset](https://huggingface.co/datasets/HHousen/ParaSCI) for paraphrase generation. It takes a sentence as input and generates a paraphrased version of that sentence.

---

## 📋 Model Details

- **Base model:** GPT-2 (`gpt2`)
- **Task:** Paraphrase generation (Causal Language Modeling)
- **Language:** English
- **Training data:** [HHousen/ParaSCI](https://huggingface.co/datasets/HHousen/ParaSCI)
- **Training steps:** 1 epoch on ~270k examples
- **Precision:** `fp16` mixed precision
- **Hardware used:** Tesla T4 (Kaggle Notebook GPU)
- **Framework:** 🤗 Transformers, PyTorch
- **Trained by:** [Your Name or HF Username]
- **License:** MIT

---

## 💡 Intended Use

### ✅ Direct Use
- Generate paraphrased versions of input English sentences in a general academic/technical writing context.

### 🚫 Out-of-Scope Use
- Not suitable for paraphrasing code, informal language, or other languages (non-English).
- Not tested for fairness, bias, or ethical use in downstream applications.

---

## 📊 Evaluation

- **Qualitative Evaluation:** Manual checks indicate coherent paraphrased outputs.
- **Automatic Metrics:** Not yet reported.

---

## 🛠 Training Details

- **Dataset:** ParaSCI (`sentence1` → `sentence2`)
- **Preprocessing:** Concatenated prompt `paraphrase this sentence: {sentence1}\n{sentence2}`
- **Tokenizer:** GPT-2 tokenizer with `pad_token = eos_token`
- **Batch size:** 8
- **Epochs:** 1
- **Learning rate:** 5e-5
- **Logging and checkpointing:** Every 500 steps, using Weights & Biases (`wandb`)
- **Max sequence length:** 256 tokens

---

## 🏁 How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-parasciparaphrase")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-parasciparaphrase")

input_text = "paraphrase this sentence: AI models can help in automating tasks.\n"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(input_ids, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))