---
library_name: transformers
tags:
- Aspect Term Extraction
- transformers
- t5
language:
- tr
metrics:
- micro-f1
base_model:
- Turkish-NLP/t5-efficient-base-turkish
pipeline_tag: text2text-generation
widget:
  - text: "Pilav çok lezzetliydi ama servis yavaştı."
    example_title: "Demo"
    output:
      text: "pilav, servis"

---

# **Sengil/t5-turkish-aspect-term-extractor** 🇹🇷

A Turkish sequence-to-sequence model based on `Turkish-NLP/t5-efficient-base-turkish`, fine-tuned for **Aspect Term Extraction (ATE)** from customer reviews and sentences.

Given a Turkish sentence, the model generates a list of **aspect terms** (e.g., *kahve*, *servis*, *fiyatlar*) that reflect the primary discussed entities or features.

---

## ✨ Example

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import re
from collections import Counter

#LOAD MODEL
MODEL_ID = "Sengil/t5-turkish-aspect-term-extractor"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID).to(DEVICE)
model.eval()

TURKISH_STOPWORDS = {
    "ve", "çok", "ama", "bir", "bu", "daha", "gibi", "ile", "için",
    "de", "da", "ki", "o", "şu", "bu", "sen", "biz", "siz", "onlar"
}

def is_valid_aspect(word):
    word = word.strip().lower()
    return (
        len(word) > 1 and
        word not in TURKISH_STOPWORDS and
        word.isalpha()
    )

def extract_and_rank_aspects(text, max_tokens=64, beams=5):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(DEVICE)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_tokens,
            num_beams=beams,
            num_return_sequences=beams,
            early_stopping=True
        )

    all_predictions = [
        tokenizer.decode(output, skip_special_tokens=True)
        for output in outputs
    ]


    all_terms = []
    for pred in all_predictions:
        candidates = re.split(r"[;,–—\-]|(?:\s*,\s*)", pred)
        all_terms.extend([w.strip().lower() for w in candidates if is_valid_aspect(w)])

    ranked = Counter(all_terms).most_common()
    return ranked


#INFERENCE
text = "Artılar: Göl manzarasıyla harika bir atmosfer, Ipoh'un her zaman sıcak olan havası nedeniyle iyi bir klima olan restoran, iyi ve hızlı hizmet sunan garsonlar, temassız ödeme kabul eden e-cüzdan, ücretsiz otopark ama sıcak güneş altında açık, yemeklerin tadı güzel."
ranked_aspects = extract_and_rank_aspects(text)

print("Sorted Aspect Terms:")
for term, score in ranked_aspects:
    print(f"{term:<15}  skor: {score}")
```

**Output:**

```
Sorted Aspect Terms:
atmosfer         skor: 1
servis           skor: 1
restoran         skor: 1
hizmet           skor: 1
```

---

## 📌 Model Details

| Detail               | Value                                        |
| -------------------- | -------------------------------------------- |
| **Model Type**       | `AutoModelForSeq2SeqLM` (T5-style)           |
| **Base Model**       | `Turkish-NLP/t5-efficient-base-turkish`      |
| **Languages**        | `tr` (Turkish)                               |
| **Fine-tuning Task** | Aspect Term Extraction (sequence generation) |
| **Framework**        | 🤗 Transformers                              |
| **License**          | Apache-2.0                                   |
| **Tokenizer**        | SentencePiece (T5-style)                     |

---

## 📊 Dataset & Training

* Total samples: 37,000+ Turkish review sentences
* Input: Raw sentence (e.g., `"Pilav çok lezzetliydi ama servis yavaştı."`)
* Target: Comma-separated aspect terms (e.g., `"pilav, servis"`)

### Training Configuration

| Setting               | Value              |
| --------------------- | ------------------ |
| **Epochs**            | 3                  |
| **Batch size**        | 8                  |
| **Max input length**  | 128 tokens         |
| **Max output length** | 64 tokens          |
| **Optimizer**         | AdamW              |
| **Learning rate**     | 3e-5               |
| **Scheduler**         | Linear             |
| **Precision**         | FP32               |
| **Hardware**          | 1× Tesla T4 / P100 |

---

### 🔍 Evaluation

The model was evaluated using exact-match micro-F1 score on a held-out test set.

| Metric          | Score |
| --------------- | ----: |
| **Micro-F1**    | 0.84+ |
| **Exact Match** | \~78% |

---

## 💡 Use Cases

* 💬 Opinion mining in Turkish product or service reviews
* 🧾 Aspect-level sentiment analysis preprocessing
* 📊 Feature-based review summarization in NLP pipelines

---

## 📦 Model Card / Citation

```bibtex
@misc{Sengil2025T5AspectTR,
  title   = {Sengil/t5-turkish-aspect-term-extractor: Turkish Aspect Term Extraction with T5},
  author  = {Şengil, Mert},
  year    = {2025},
  url     = {https://huggingface.co/Sengil/t5-turkish-aspect-term-extractor}
}
```

---

For contributions, improvements, or issue reporting, feel free to open a GitHub/Hugging Face issue or contact **[Mert Şengil](https://www.linkedin.com/in/mertsengil/)**.