naailkhokhar
/

PasswordHealthModel

+---
+license: mit
+language:
+- en
+pipeline_tag: text-classification
+library_name: scikit-learn
+tags:
+  - password-strength
+  - cybersecurity
+  - random-forest
+  - scikit-learn
+  - password-classification
+  - password-security
+  - sklearn
+---
+# PasswordHealthModel
+**Model Type**: Random Forest Classifier
+**Framework**: scikit-learn
+**Task**: Password Strength Classification (Weak / Medium / Strong)
+## Overview
+PasswordHealthModel is a machine learning model that classifies passwords into three strength levels:
+- **Weak (0)**
+- **Medium (1)**
+- **Strong (2)**
+The model leverages a Random Forest Classifier trained on 300,000 labeled passwords and is designed for integration into password management systems to provide real-time strength evaluation and guidance.
+## Intended Uses
+- Integration into password managers (e.g., [Password Utility](https://github.com/naail-khokhar/password_utility)) for evaluating password health.
+- Providing real-time feedback on password strength and generating recommendations for stronger passwords.
+- Enforcing password strength policies in security-focused applications.
+## Training Data
+- **Weak**: 100,000 passwords sourced from the [SecLists dataset](https://github.com/danielmiessler/SecLists).
+- **Medium**: 100,000 synthetically generated passwords (8–12 characters, alphanumeric, 20% with symbols).
+- **Strong**: 100,000 synthetically generated passwords (12–16 characters, alphanumeric + symbols).
+All passwords were stripped of whitespace prior to feature extraction.
+## Features (10 Total)
+- **length**: Number of characters.
+- **entropy**: Shannon entropy of characters.
+- **has_upper**: Binary flag indicating presence of uppercase characters.
+- **has_symbol**: Binary flag indicating presence of special characters.
+- **has_leet**: Binary flag for leet-speak characters (e.g., @, 3, !, 0).
+- **repetition**: Binary flag for repeated sequences (≥3 consecutive repeated characters).
+- **digit_ratio**: Ratio of digits to total length.
+- **unique_ratio**: Ratio of unique characters to total length.
+- **bigram_entropy**: Entropy of character pairs (bigrams).
+- **compression_ratio**: Ratio of compressed length to original length using zlib compression.
+## Model Architecture
+- **Algorithm**: Random Forest Classifier (scikit-learn)
+- **Hyperparameters**:
+  - `n_estimators`: 200
+  - `max_depth`: 20
+  - `min_samples_split`: 5
+  - `random_state`: 42
+## Performance
+- **Evaluation Setup**: 80/20 train-test split (80% training, 20% testing; 240,000 training samples, 60,000 test samples)
+- **Accuracy**: ~96.7% (±0.6% standard deviation)
+## Limitations
+- Feature engineering is heuristic-based and may not fully capture all password patterns across different contexts.
+- Primarily trained on English-like and synthetic passwords.
+- Potential overfitting to synthetic strong password patterns.
+## Ethical Considerations
+Weak password data is sourced from publicly available breaches with careful handling. The model does not store actual user passwords and is intended only for classification tasks.
+## Dependencies
+My project relies on the following open-source libraries and datasets:
+- **[pandas](https://github.com/pandas-dev/pandas)**: Data manipulation and analysis (BSD-3-Clause License).
+- **[scikit-learn](https://github.com/scikit-learn/scikit-learn)**: Machine learning framework for the Random Forest Classifier (BSD-3-Clause License).
+- **[joblib](https://github.com/joblib/joblib)**: Model persistence and parallel computation (MIT License).
+- **[SecLists](https://github.com/danielmiessler/SecLists)**: Dataset for weak passwords (MIT License).
+If redistributing this project, please include the respective license texts for these dependencies.
+## Citation
+Khokhar, Naa'il Ahmad. (2025). *PasswordHealthModel: A Random Forest Model for Password Strength Classification*. Hugging Face Model Hub.