Create README.md

3d3f450 verified 5 months ago

3.8 kB

	---
	tags:
	- text-classification
	- scientific-abstract
	- multi-label
	- sentiment-analysis
	- distilbert
	datasets:
	- SciTopicSentimentDataset
	license: apache-2.0
	---

	# SciTopicSentimentClassifier

	## 🔬 Overview

	SciTopicSentimentClassifier is a multi-label classification model fine-tuned to simultaneously predict the primary scientific topic and the underlying sentiment (high-positive or low-negative) from a research paper's abstract text. This model is ideal for automated paper categorization, literature review triage, and scientific trend analysis.

	The model was trained on the SciTopicSentimentDataset (a proprietary dataset similar to the generated Dataset 1), which links abstract text to predefined scientific topics and a binarized sentiment score derived from the original continuous value.

	## 🧠 Model Architecture

	This model is an adaptation of DistilBERT, a smaller, faster, and lighter version of BERT.

	* Base Model: `distilbert-base-uncased`
	* Modification: A custom classification head is added on top of the DistilBERT pooled output.
	* Output Layer: The final layer is a dense layer with 12 outputs (10 for scientific topics + 2 for sentiment classes), followed by a Sigmoid activation function to allow for multi-label prediction (an abstract can belong to multiple topics/sentiments).
	* Input: Tokenized abstract text (up to 512 tokens).
	* Task: Multi-Label Text Classification.

	## 🚀 Intended Use

	* Automated Labeling: Automatically assign relevant topic tags to new scientific publication abstracts.
	* Research Triage: Quickly filter papers based on subject matter and the perceived 'success' or 'novelty' indicated by the abstract's sentiment.
	* Scientific Landscape Mapping: Analyze large corpora of papers to track emerging positive/negative trends in specific research areas.
	* Indexing Systems: Integration into library or repository indexing services.

	## ⚠️ Limitations

	* Topic Granularity: The model is limited to the 10 predefined topics in its training set. It may perform poorly on highly niche or interdisciplinary topics outside this scope.
	* Sentiment Scope: The sentiment is coarse-grained (high vs. low) based on a metric derived from the abstract's language (e.g., using words like "novel," "significant," "limitations," "challenges"). It does not capture nuanced human-level emotional sentiment.
	* Language: Trained exclusively on English abstracts.
	* Max Length: Input texts longer than 512 tokens are truncated.

	## 💻 Example Code

	To use the model for prediction:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load the model and tokenizer
	model_name = "your-username/SciTopicSentimentClassifier" # Replace with actual HuggingFace path
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Sample Abstract
	abstract = "We propose a novel architecture combining convolutional and recurrent neural networks for multi-modal data fusion, demonstrating significant performance gains in complex classification tasks, overcoming prior limitations."

	# Preprocess the input
	inputs = tokenizer(abstract, return_tensors="pt", truncation=True, padding=True)

	# Run inference
	with torch.no_grad():
	logits = model(**inputs).logits

	# Apply sigmoid for multi-label scores
	probs = torch.sigmoid(logits)

	# Get predicted labels (e.g., probability > 0.5)
	labels = model.config.id2label
	predictions = []
	for i, prob in enumerate(probs[0]):
	if prob > 0.5:
	predictions.append(labels[i])

	print(f"Abstract: {abstract[:80]}...")
	print(f"Predicted Labels: {predictions}")
	# Expected Output: ['Deep Learning/AI', 'High-Positive-Sentiment']