naailkhokhar commited on
Commit
85e932c
·
verified ·
1 Parent(s): 12dc1aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -3
README.md CHANGED
@@ -1,3 +1,96 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: text-classification
6
+ library_name: scikit-learn
7
+ tags:
8
+ - password-strength
9
+ - cybersecurity
10
+ - random-forest
11
+ - scikit-learn
12
+ - password-classification
13
+ - password-security
14
+ - sklearn
15
+ ---
16
+ # PasswordHealthModel
17
+
18
+ **Model Type**: Random Forest Classifier
19
+ **Framework**: scikit-learn
20
+ **Task**: Password Strength Classification (Weak / Medium / Strong)
21
+
22
+ ## Overview
23
+
24
+ PasswordHealthModel is a machine learning model that classifies passwords into three strength levels:
25
+
26
+ - **Weak (0)**
27
+ - **Medium (1)**
28
+ - **Strong (2)**
29
+
30
+ The model leverages a Random Forest Classifier trained on 300,000 labeled passwords and is designed for integration into password management systems to provide real-time strength evaluation and guidance.
31
+
32
+ ## Intended Uses
33
+
34
+ - Integration into password managers (e.g., [Password Utility](https://github.com/naail-khokhar/password_utility)) for evaluating password health.
35
+ - Providing real-time feedback on password strength and generating recommendations for stronger passwords.
36
+ - Enforcing password strength policies in security-focused applications.
37
+
38
+ ## Training Data
39
+
40
+ - **Weak**: 100,000 passwords sourced from the [SecLists dataset](https://github.com/danielmiessler/SecLists).
41
+ - **Medium**: 100,000 synthetically generated passwords (8–12 characters, alphanumeric, 20% with symbols).
42
+ - **Strong**: 100,000 synthetically generated passwords (12–16 characters, alphanumeric + symbols).
43
+
44
+ All passwords were stripped of whitespace prior to feature extraction.
45
+
46
+ ## Features (10 Total)
47
+
48
+ - **length**: Number of characters.
49
+ - **entropy**: Shannon entropy of characters.
50
+ - **has_upper**: Binary flag indicating presence of uppercase characters.
51
+ - **has_symbol**: Binary flag indicating presence of special characters.
52
+ - **has_leet**: Binary flag for leet-speak characters (e.g., @, 3, !, 0).
53
+ - **repetition**: Binary flag for repeated sequences (≥3 consecutive repeated characters).
54
+ - **digit_ratio**: Ratio of digits to total length.
55
+ - **unique_ratio**: Ratio of unique characters to total length.
56
+ - **bigram_entropy**: Entropy of character pairs (bigrams).
57
+ - **compression_ratio**: Ratio of compressed length to original length using zlib compression.
58
+
59
+ ## Model Architecture
60
+
61
+ - **Algorithm**: Random Forest Classifier (scikit-learn)
62
+ - **Hyperparameters**:
63
+ - `n_estimators`: 200
64
+ - `max_depth`: 20
65
+ - `min_samples_split`: 5
66
+ - `random_state`: 42
67
+
68
+ ## Performance
69
+
70
+ - **Evaluation Setup**: 80/20 train-test split (80% training, 20% testing; 240,000 training samples, 60,000 test samples)
71
+ - **Accuracy**: ~96.7% (±0.6% standard deviation)
72
+
73
+ ## Limitations
74
+
75
+ - Feature engineering is heuristic-based and may not fully capture all password patterns across different contexts.
76
+ - Primarily trained on English-like and synthetic passwords.
77
+ - Potential overfitting to synthetic strong password patterns.
78
+
79
+ ## Ethical Considerations
80
+
81
+ Weak password data is sourced from publicly available breaches with careful handling. The model does not store actual user passwords and is intended only for classification tasks.
82
+
83
+ ## Dependencies
84
+
85
+ My project relies on the following open-source libraries and datasets:
86
+
87
+ - **[pandas](https://github.com/pandas-dev/pandas)**: Data manipulation and analysis (BSD-3-Clause License).
88
+ - **[scikit-learn](https://github.com/scikit-learn/scikit-learn)**: Machine learning framework for the Random Forest Classifier (BSD-3-Clause License).
89
+ - **[joblib](https://github.com/joblib/joblib)**: Model persistence and parallel computation (MIT License).
90
+ - **[SecLists](https://github.com/danielmiessler/SecLists)**: Dataset for weak passwords (MIT License).
91
+
92
+ If redistributing this project, please include the respective license texts for these dependencies.
93
+
94
+ ## Citation
95
+
96
+ Khokhar, Naa'il Ahmad. (2025). *PasswordHealthModel: A Random Forest Model for Password Strength Classification*. Hugging Face Model Hub.