kohils
/

Twitter-Cyberbullying-Classification

 ---
+library_name: sklearn
+tags:
+- text-classification
+- cyberbullying
+- nlp
+- social-impact
+- ensemble-learning
+dataset_info:
+  name: Cyberbullying Classification
+  source: Kaggle (andrewmvd/cyberbullying-classification)
+metrics:
+- accuracy
+model_file: voting_classifier_model.pkl
 ---
+# Cyberbullying Classification Model (Scikit-Learn)
+This is a traditional Machine Learning model that classifies tweets into different categories of cyberbullying. It is an ensemble **Voting Classifier** combining **Logistic Regression** and **Random Forest**, achieving approximately **91% accuracy**.
+## Model Details
+- **Developed by:** Kohil Sharma
+- **Model Type:** Voting Classifier (Logistic Regression + Random Forest)
+- **Feature Extraction:** TF-IDF (Term Frequency-Inverse Document Frequency)
+- **Library:** Scikit-Learn
+- **Language:** English
+## Intended Use
+This model is designed to detect specific types of cyberbullying in text. It is lightweight and faster than transformer models, making it suitable for low-resource environments.
+### Classification Labels
+The model classifies text into **5 categories** (mapped as follows):
+- `0`: **Not Cyberbullying**
+- `1`: **Gender** (Sexist)
+- `2`: **Religion**
+- `3`: **Age**
+- `4`: **Ethnicity** (Racist)
+*(Note: The 'Other' category was removed during preprocessing to improve accuracy.)*
+## Training Data
+- **Dataset:** [Cyberbullying Classification Tweets](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification)
+- **Original Size:** ~47,000 tweets
+- **Processed Size:** ~38,000 tweets (after cleaning duplicates and removing the 'Other' class)
+## Training Procedure
+### 1. Preprocessing
+The text underwent rigorous cleaning using the `tweet-preprocessor` library and custom functions:
+- Removal of Usernames (@), Hashtags (#), and Links (http).
+- Removal of punctuation and special characters.
+- Conversion to lowercase.
+- **Lemmatization** using NLTK's `WordNetLemmatizer`.
+- Stopword removal (including Twitter-specific stopwords like "rt", "mkr").
+### 2. Feature Engineering
+- **TF-IDF Vectorizer** was used to convert text into numerical vectors.
+### 3. Model Architecture
+- **Base Models:**
+    1. `LogisticRegression` (C=100, penalty='l2')
+    2. `RandomForestClassifier` (n_estimators=100)
+- **Ensemble:** `VotingClassifier` (Hard Voting) combining the above two.
+## Evaluation Results
+- **Accuracy:** ~91% on the test set.
+- **Strengths:** High precision in distinguishing Ethnicity, Religion, and Age-based bullying.
+## How to Use
+To use this model in Python, you need to load both the vectorizer and the model using `joblib`.
+```python
+import joblib
+import preprocessor as p # pip install tweet-preprocessor
+import string
+# 1. Load the saved files
+model = joblib.load('model.pickle')
+vectorizer = joblib.load('tfidf.pickle')
+# 2. Define the cleaning function (Must match training!)
+def clean_text(text):
+    text = p.clean(text)
+    text = text.lower()
+    text = "".join([char for char in text if char not in string.punctuation])
+    return text
+# 3. Make a prediction
+text = "You are dumb and you should go back to school."
+clean_input = clean_text(text)
+# Vectorize the text
+vectorized_input = vectorizer.transform([clean_input])
+# Predict
+prediction = model.predict(vectorized_input)
+classes = {0: 'Not Cyberbullying', 1: 'Gender', 2: 'Religion', 3: 'Age', 4: 'Ethnicity'}
+print(f"Prediction: {classes[prediction[0]]}")