kohils commited on
Commit
b7ac775
·
verified ·
1 Parent(s): 29f4cd2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -1
README.md CHANGED
@@ -1,3 +1,101 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: sklearn
3
+ tags:
4
+ - text-classification
5
+ - cyberbullying
6
+ - nlp
7
+ - social-impact
8
+ - ensemble-learning
9
+ dataset_info:
10
+ name: Cyberbullying Classification
11
+ source: Kaggle (andrewmvd/cyberbullying-classification)
12
+ metrics:
13
+ - accuracy
14
+ model_file: voting_classifier_model.pkl
15
  ---
16
+
17
+ # Cyberbullying Classification Model (Scikit-Learn)
18
+
19
+ This is a traditional Machine Learning model that classifies tweets into different categories of cyberbullying. It is an ensemble **Voting Classifier** combining **Logistic Regression** and **Random Forest**, achieving approximately **91% accuracy**.
20
+
21
+ ## Model Details
22
+
23
+ - **Developed by:** Kohil Sharma
24
+ - **Model Type:** Voting Classifier (Logistic Regression + Random Forest)
25
+ - **Feature Extraction:** TF-IDF (Term Frequency-Inverse Document Frequency)
26
+ - **Library:** Scikit-Learn
27
+ - **Language:** English
28
+
29
+ ## Intended Use
30
+ This model is designed to detect specific types of cyberbullying in text. It is lightweight and faster than transformer models, making it suitable for low-resource environments.
31
+
32
+ ### Classification Labels
33
+ The model classifies text into **5 categories** (mapped as follows):
34
+ - `0`: **Not Cyberbullying**
35
+ - `1`: **Gender** (Sexist)
36
+ - `2`: **Religion**
37
+ - `3`: **Age**
38
+ - `4`: **Ethnicity** (Racist)
39
+
40
+ *(Note: The 'Other' category was removed during preprocessing to improve accuracy.)*
41
+
42
+ ## Training Data
43
+ - **Dataset:** [Cyberbullying Classification Tweets](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification)
44
+ - **Original Size:** ~47,000 tweets
45
+ - **Processed Size:** ~38,000 tweets (after cleaning duplicates and removing the 'Other' class)
46
+
47
+ ## Training Procedure
48
+
49
+ ### 1. Preprocessing
50
+ The text underwent rigorous cleaning using the `tweet-preprocessor` library and custom functions:
51
+ - Removal of Usernames (@), Hashtags (#), and Links (http).
52
+ - Removal of punctuation and special characters.
53
+ - Conversion to lowercase.
54
+ - **Lemmatization** using NLTK's `WordNetLemmatizer`.
55
+ - Stopword removal (including Twitter-specific stopwords like "rt", "mkr").
56
+
57
+ ### 2. Feature Engineering
58
+ - **TF-IDF Vectorizer** was used to convert text into numerical vectors.
59
+
60
+ ### 3. Model Architecture
61
+ - **Base Models:**
62
+ 1. `LogisticRegression` (C=100, penalty='l2')
63
+ 2. `RandomForestClassifier` (n_estimators=100)
64
+ - **Ensemble:** `VotingClassifier` (Hard Voting) combining the above two.
65
+
66
+ ## Evaluation Results
67
+ - **Accuracy:** ~91% on the test set.
68
+ - **Strengths:** High precision in distinguishing Ethnicity, Religion, and Age-based bullying.
69
+
70
+ ## How to Use
71
+
72
+ To use this model in Python, you need to load both the vectorizer and the model using `joblib`.
73
+
74
+ ```python
75
+ import joblib
76
+ import preprocessor as p # pip install tweet-preprocessor
77
+ import string
78
+
79
+ # 1. Load the saved files
80
+ model = joblib.load('model.pickle')
81
+ vectorizer = joblib.load('tfidf.pickle')
82
+
83
+ # 2. Define the cleaning function (Must match training!)
84
+ def clean_text(text):
85
+ text = p.clean(text)
86
+ text = text.lower()
87
+ text = "".join([char for char in text if char not in string.punctuation])
88
+ return text
89
+
90
+ # 3. Make a prediction
91
+ text = "You are dumb and you should go back to school."
92
+ clean_input = clean_text(text)
93
+
94
+ # Vectorize the text
95
+ vectorized_input = vectorizer.transform([clean_input])
96
+
97
+ # Predict
98
+ prediction = model.predict(vectorized_input)
99
+ classes = {0: 'Not Cyberbullying', 1: 'Gender', 2: 'Religion', 3: 'Age', 4: 'Ethnicity'}
100
+
101
+ print(f"Prediction: {classes[prediction[0]]}")