|
--- |
|
library_name: transformers |
|
tags: |
|
- text-generation |
|
- paraphrase |
|
- gpt2 |
|
- causal-lm |
|
- transformers |
|
- pytorch |
|
license: mit |
|
datasets: |
|
- HHousen/ParaSCI |
|
language: |
|
- en |
|
base_model: |
|
- openai-community/gpt2 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Model Card for `gpt2-parasciparaphrase` |
|
|
|
## 🧠 Model Summary |
|
|
|
This model is a fine-tuned version of [GPT-2](https://huggingface.co/gpt2) on the [ParaSCI dataset](https://huggingface.co/datasets/HHousen/ParaSCI) for paraphrase generation. It takes a sentence as input and generates a paraphrased version of that sentence. |
|
|
|
--- |
|
|
|
## 📋 Model Details |
|
|
|
- **Base model:** GPT-2 (`gpt2`) |
|
- **Task:** Paraphrase generation (Causal Language Modeling) |
|
- **Language:** English |
|
- **Training data:** [HHousen/ParaSCI](https://huggingface.co/datasets/HHousen/ParaSCI) |
|
- **Training steps:** 1 epoch on ~270k examples |
|
- **Precision:** `fp16` mixed precision |
|
- **Hardware used:** Tesla T4 (Kaggle Notebook GPU) |
|
- **Framework:** 🤗 Transformers, PyTorch |
|
- **Trained by:** [Your Name or HF Username] |
|
- **License:** MIT |
|
|
|
--- |
|
|
|
## 💡 Intended Use |
|
|
|
### ✅ Direct Use |
|
- Generate paraphrased versions of input English sentences in a general academic/technical writing context. |
|
|
|
### 🚫 Out-of-Scope Use |
|
- Not suitable for paraphrasing code, informal language, or other languages (non-English). |
|
- Not tested for fairness, bias, or ethical use in downstream applications. |
|
|
|
--- |
|
|
|
## 📊 Evaluation |
|
|
|
- **Qualitative Evaluation:** Manual checks indicate coherent paraphrased outputs. |
|
- **Automatic Metrics:** Not yet reported. |
|
|
|
--- |
|
|
|
## 🛠 Training Details |
|
|
|
- **Dataset:** ParaSCI (`sentence1` → `sentence2`) |
|
- **Preprocessing:** Concatenated prompt `paraphrase this sentence: {sentence1}\n{sentence2}` |
|
- **Tokenizer:** GPT-2 tokenizer with `pad_token = eos_token` |
|
- **Batch size:** 8 |
|
- **Epochs:** 1 |
|
- **Learning rate:** 5e-5 |
|
- **Logging and checkpointing:** Every 500 steps, using Weights & Biases (`wandb`) |
|
- **Max sequence length:** 256 tokens |
|
|
|
--- |
|
|
|
## 🏁 How to Use |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-parasciparaphrase") |
|
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-parasciparaphrase") |
|
|
|
input_text = "paraphrase this sentence: AI models can help in automating tasks.\n" |
|
input_ids = tokenizer.encode(input_text, return_tensors="pt") |
|
|
|
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95) |
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |