| --- |
| tags: |
| - text-classification |
| - scientific-abstract |
| - multi-label |
| - sentiment-analysis |
| - distilbert |
| datasets: |
| - SciTopicSentimentDataset |
| license: apache-2.0 |
| --- |
| |
| # SciTopicSentimentClassifier |
|
|
| ## 🔬 Overview |
|
|
| SciTopicSentimentClassifier is a **multi-label classification** model fine-tuned to simultaneously predict the **primary scientific topic** and the **underlying sentiment** (high-positive or low-negative) from a research paper's abstract text. This model is ideal for automated paper categorization, literature review triage, and scientific trend analysis. |
|
|
| The model was trained on the SciTopicSentimentDataset (a proprietary dataset similar to the generated Dataset 1), which links abstract text to predefined scientific topics and a binarized sentiment score derived from the original continuous value. |
|
|
| ## 🧠 Model Architecture |
|
|
| This model is an adaptation of **DistilBERT**, a smaller, faster, and lighter version of BERT. |
|
|
| * **Base Model:** `distilbert-base-uncased` |
| * **Modification:** A custom classification head is added on top of the DistilBERT pooled output. |
| * **Output Layer:** The final layer is a dense layer with **12 outputs** (10 for scientific topics + 2 for sentiment classes), followed by a Sigmoid activation function to allow for multi-label prediction (an abstract can belong to multiple topics/sentiments). |
| * **Input:** Tokenized abstract text (up to 512 tokens). |
| * **Task:** Multi-Label Text Classification. |
|
|
| ## 🚀 Intended Use |
|
|
| * **Automated Labeling:** Automatically assign relevant topic tags to new scientific publication abstracts. |
| * **Research Triage:** Quickly filter papers based on subject matter and the perceived 'success' or 'novelty' indicated by the abstract's sentiment. |
| * **Scientific Landscape Mapping:** Analyze large corpora of papers to track emerging positive/negative trends in specific research areas. |
| * **Indexing Systems:** Integration into library or repository indexing services. |
|
|
| ## ⚠️ Limitations |
|
|
| * **Topic Granularity:** The model is limited to the 10 predefined topics in its training set. It may perform poorly on highly niche or interdisciplinary topics outside this scope. |
| * **Sentiment Scope:** The sentiment is coarse-grained (high vs. low) based on a metric derived from the abstract's language (e.g., using words like "novel," "significant," "limitations," "challenges"). It does not capture nuanced human-level emotional sentiment. |
| * **Language:** Trained exclusively on English abstracts. |
| * **Max Length:** Input texts longer than 512 tokens are truncated. |
|
|
| ## 💻 Example Code |
|
|
| To use the model for prediction: |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| # Load the model and tokenizer |
| model_name = "your-username/SciTopicSentimentClassifier" # Replace with actual HuggingFace path |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| |
| # Sample Abstract |
| abstract = "We propose a novel architecture combining convolutional and recurrent neural networks for multi-modal data fusion, demonstrating significant performance gains in complex classification tasks, overcoming prior limitations." |
| |
| # Preprocess the input |
| inputs = tokenizer(abstract, return_tensors="pt", truncation=True, padding=True) |
| |
| # Run inference |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| |
| # Apply sigmoid for multi-label scores |
| probs = torch.sigmoid(logits) |
| |
| # Get predicted labels (e.g., probability > 0.5) |
| labels = model.config.id2label |
| predictions = [] |
| for i, prob in enumerate(probs[0]): |
| if prob > 0.5: |
| predictions.append(labels[i]) |
| |
| print(f"Abstract: {abstract[:80]}...") |
| print(f"Predicted Labels: {predictions}") |
| # Expected Output: ['Deep Learning/AI', 'High-Positive-Sentiment'] |