π³ Credit Card Fraud Detection with Random Forest
π Project Description
This project detects fraudulent credit card transactions using a supervised machine learning approach. The dataset is highly imbalanced, making it a real-world anomaly detection problem. We trained a Random Forest Classifier optimized for performance and robustness.
π Dataset Overview
- Source: Kaggle - Credit Card Fraud Detection
- Description: Transactions made by European cardholders in September 2013.
- Total Samples: 284,807 transactions
- Fraudulent Cases: 492 (~0.172%)
- Features:
Time: Time elapsed from the first transactionAmount: Transaction amountV1toV28: Principal components (PCA-transformed)Class: Target (0 = Legitimate, 1 = Fraudulent)
π§ Model Used
RandomForestClassifier Configuration:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(
n_estimators=500,
max_depth=20,
min_samples_split=2,
min_samples_leaf=1,
max_features='sqrt',
bootstrap=True,
random_state=42,
n_jobs=-1
)
π Model Evaluation Metrics
| Metric | Value |
|---|---|
| Accuracy | 0.9996 |
| Precision | 0.9747 |
| Recall (Sensitivity) | 0.7857 |
| F1 Score | 0.8701 |
| Matthews Correlation Coefficient (MCC) | 0.8749 |
π Interpretation:
- High accuracy is expected due to class imbalance.
- Precision is high: most predicted frauds are true frauds.
- Recall is moderate: some frauds are missed.
- F1 score balances precision and recall.
- MCC gives a reliable measure even with class imbalance.
β±οΈ Performance Timing
| Phase | Time (seconds) |
|---|---|
| Training | 375.41 |
| Prediction | 0.94 |
π¦ Exported Artifacts
random_forest_model_fraud_classification.pkl: Trained Random Forest modelfeatures.json: Feature list used during training
π Usage Guide
1οΈβ£ Install Dependencies
pip install pandas scikit-learn joblib
2οΈβ£ Load Model and Features
import joblib
import json
import pandas as pd
# Load the trained model
model = joblib.load("random_forest_model_fraud_classification.pkl")
# Load the feature list
with open("features.json", "r") as f:
features = json.load(f)
3οΈβ£ Prepare Input Data
# Load your new transaction data
df = pd.read_csv("your_new_transactions.csv")
# Filter to keep only relevant features
df = df[features]
4οΈβ£ Make Predictions
# Predict classes
predictions = model.predict(df)
# Predict fraud probability
probabilities = model.predict_proba(df)[:, 1]
print(predictions)
print(probabilities)
π Notes
- Due to the high class imbalance, precision and recall should always be monitored.
- Adjust the decision threshold to optimize for recall or precision depending on your business needs.
- The model generalizes well but should be retrained periodically with new data.
π Acknowledgements
- Dataset provided by ULB & Worldline
- Original research: Dal Pozzolo et al.
- Credit Card Fraud Detection - Kaggle
π License
Apache License 2.0 β you are free to use, modify, and distribute this project under the terms of the Apache 2.0 License.