Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,255 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
tags:
|
4 |
+
- sklearn
|
5 |
+
- ensemble
|
6 |
+
- heart-disease
|
7 |
+
- cardiovascular-risk
|
8 |
+
- imbalanced-data
|
9 |
+
- explainable-ai
|
10 |
+
metrics:
|
11 |
+
- roc_auc
|
12 |
+
- accuracy
|
13 |
+
- recall
|
14 |
+
- precision
|
15 |
+
- f1
|
16 |
+
language:
|
17 |
+
- en
|
18 |
---
|
19 |
+
|
20 |
+
# Model Card for Heart Risk AI - Cardiovascular Risk Prediction v3
|
21 |
+
|
22 |
+
This is an advanced ensemble model designed to predict the risk of cardiovascular disease, optimized for high sensitivity to minimize missed diagnoses in medical screening scenarios.
|
23 |
+
|
24 |
+
## Model Details
|
25 |
+
|
26 |
+
### Model Description
|
27 |
+
|
28 |
+
This model combines four powerful machine learning algorithms (Random Forest, XGBoost, Gradient Boosting, and Logistic Regression) in an ensemble approach to predict cardiovascular disease risk. The model is specifically optimized for high sensitivity (recall), making it suitable for initial screening where detecting all potential cases is prioritized over minimizing false positives.
|
29 |
+
|
30 |
+
- **Developed by:** [Juan Manuel Infante Quiroga]
|
31 |
+
- **Model type:** Ensemble Classification (Random Forest + XGBoost + Gradient Boosting + Logistic Regression)
|
32 |
+
- **Language(s):** Python (scikit-learn)
|
33 |
+
- **License:** Apache 2.0
|
34 |
+
- **Version:** v3 (optimized for high sensitivity)
|
35 |
+
|
36 |
+
### Model Sources
|
37 |
+
|
38 |
+
- **Repository:** [https://github.com/JuanInfante122/heart-risk-ai.git]
|
39 |
+
- **Dataset:** UCI Heart Disease Dataset (Cleveland Clinic Foundation)
|
40 |
+
- **Framework:** scikit-learn, XGBoost
|
41 |
+
|
42 |
+
## Uses
|
43 |
+
|
44 |
+
### Direct Use
|
45 |
+
|
46 |
+
This model is intended for **educational and research purposes only**. It can be used as a screening tool for:
|
47 |
+
|
48 |
+
- Initial cardiovascular risk assessment in populations over 40 years of age
|
49 |
+
- Triage support in emergency services
|
50 |
+
- Risk factor monitoring for follow-up patients
|
51 |
+
- Patient education and awareness programs
|
52 |
+
|
53 |
+
**⚠️ IMPORTANT: This tool does NOT replace complete clinical evaluation, specific diagnostic tests, or professional medical judgment.**
|
54 |
+
|
55 |
+
### Downstream Use
|
56 |
+
|
57 |
+
The model can be integrated into:
|
58 |
+
- Healthcare screening applications
|
59 |
+
- Clinical decision support systems (as a supplementary tool)
|
60 |
+
- Research studies on cardiovascular risk prediction
|
61 |
+
- Educational platforms for medical training
|
62 |
+
|
63 |
+
### Out-of-Scope Use
|
64 |
+
|
65 |
+
- **Definitive diagnosis:** Never use for final diagnostic decisions
|
66 |
+
- **Treatment decisions:** Not suitable for determining treatment protocols
|
67 |
+
- **Standalone medical advice:** Requires professional medical interpretation
|
68 |
+
- **Real-time emergency decisions:** Not validated for acute care settings
|
69 |
+
|
70 |
+
## Bias, Risks, and Limitations
|
71 |
+
|
72 |
+
### Known Limitations
|
73 |
+
|
74 |
+
1. **Population Bias:** Training data primarily from Caucasian population, limiting applicability to other ethnic groups
|
75 |
+
2. **Missing Biomarkers:** Does not include advanced biomarkers (troponins, BNP, etc.)
|
76 |
+
3. **Medication Status:** Does not consider patient's current medication regimen
|
77 |
+
4. **Temporal Validation:** Requires validation in prospective cohorts
|
78 |
+
5. **High False Positive Rate:** 45.3% of healthy patients may be incorrectly flagged as at-risk
|
79 |
+
|
80 |
+
### Technical Risks
|
81 |
+
|
82 |
+
- **Low Specificity (54.7%):** High number of false positives may cause unnecessary anxiety and healthcare costs
|
83 |
+
- **Dataset Size:** Limited training data (303 patients) may affect generalizability
|
84 |
+
- **Feature Engineering:** Model relies on 7 engineered features that may not generalize to other datasets
|
85 |
+
|
86 |
+
### Recommendations
|
87 |
+
|
88 |
+
- Use only as a preliminary screening tool, never for final diagnosis
|
89 |
+
- Always combine with clinical judgment and additional testing
|
90 |
+
- Consider the high false positive rate when interpreting results
|
91 |
+
- Validate performance on diverse populations before broader deployment
|
92 |
+
- Implement appropriate patient counseling protocols for positive predictions
|
93 |
+
|
94 |
+
## How to Get Started with the Model
|
95 |
+
|
96 |
+
```python
|
97 |
+
import joblib
|
98 |
+
import pandas as pd
|
99 |
+
|
100 |
+
# Load the model
|
101 |
+
model_data = joblib.load('models/heart_risk_ensemble_v3.pkl')
|
102 |
+
ensemble_model = model_data['ensemble_model']
|
103 |
+
scaler = model_data['scaler']
|
104 |
+
feature_names = model_data['feature_names']
|
105 |
+
optimal_threshold = model_data['optimal_threshold']
|
106 |
+
|
107 |
+
# Example patient data
|
108 |
+
patient_data = {
|
109 |
+
'age': 65, 'sex': 1, 'cp': 1, 'trestbps': 150, 'chol': 280,
|
110 |
+
'fbs': 1, 'restecg': 1, 'thalach': 120, 'exang': 1, 'oldpeak': 3.0,
|
111 |
+
'slope': 1, 'ca': 2, 'thal': 7,
|
112 |
+
'feature_14': 0, 'feature_15': 0, 'feature_16': 0,
|
113 |
+
'feature_17': 0, 'feature_18': 0, 'feature_19': 0, 'feature_20': 0
|
114 |
+
}
|
115 |
+
|
116 |
+
# Prepare and predict
|
117 |
+
patient_df = pd.DataFrame([patient_data], columns=feature_names)
|
118 |
+
patient_scaled = scaler.transform(patient_df)
|
119 |
+
probability = ensemble_model.predict_proba(patient_scaled)[0, 1]
|
120 |
+
prediction = (probability >= optimal_threshold).astype(int)
|
121 |
+
|
122 |
+
print(f"Risk Probability: {probability:.2f}")
|
123 |
+
print(f"High Risk: {'Yes' if prediction else 'No'}")
|
124 |
+
```
|
125 |
+
|
126 |
+
## Training Details
|
127 |
+
|
128 |
+
### Training Data
|
129 |
+
|
130 |
+
- **Source:** UCI Heart Disease Dataset from Cleveland Clinic Foundation
|
131 |
+
- **Size:** 303 patient records
|
132 |
+
- **Features:** 13 original clinical features + 7 engineered features
|
133 |
+
- **Target:** Binary classification (presence/absence of heart disease)
|
134 |
+
- **Preprocessing:** SMOTE (Synthetic Minority Over-sampling Technique) applied to handle class imbalance
|
135 |
+
|
136 |
+
### Training Procedure
|
137 |
+
|
138 |
+
#### Preprocessing
|
139 |
+
|
140 |
+
1. Feature scaling using StandardScaler
|
141 |
+
2. SMOTE oversampling for class balance
|
142 |
+
3. Feature engineering to create 7 additional predictive features
|
143 |
+
4. Train/test split: 80/20
|
144 |
+
|
145 |
+
#### Training Hyperparameters
|
146 |
+
|
147 |
+
- **Training regime:** Standard precision (fp32)
|
148 |
+
- **Ensemble method:** Voting classifier with soft voting
|
149 |
+
- **Cross-validation:** 5-fold stratified cross-validation
|
150 |
+
- **Optimization target:** Maximized sensitivity (recall)
|
151 |
+
- **Threshold optimization:** Custom threshold tuning for high sensitivity
|
152 |
+
|
153 |
+
## Evaluation
|
154 |
+
|
155 |
+
### Testing Data, Factors & Metrics
|
156 |
+
|
157 |
+
#### Testing Data
|
158 |
+
|
159 |
+
- **Size:** 20% holdout set from UCI Heart Disease Dataset
|
160 |
+
- **Evaluation method:** Stratified sampling to maintain class distribution
|
161 |
+
|
162 |
+
#### Metrics
|
163 |
+
|
164 |
+
Performance optimized for medical screening scenarios where high sensitivity is critical:
|
165 |
+
|
166 |
+
- **ROC-AUC:** Area under the receiver operating characteristic curve
|
167 |
+
- **Sensitivity (Recall):** True positive rate - primary optimization target
|
168 |
+
- **Specificity:** True negative rate
|
169 |
+
- **Precision:** Positive predictive value
|
170 |
+
- **F1-Score:** Harmonic mean of precision and recall
|
171 |
+
|
172 |
+
### Results
|
173 |
+
|
174 |
+
| Metric | Value | Interpretation |
|
175 |
+
|--------|-------|----------------|
|
176 |
+
| **AUC-ROC** | 0.7986 | Good discriminative ability |
|
177 |
+
| **Accuracy** | 0.6984 | Overall correct predictions |
|
178 |
+
| **Sensitivity (Recall)** | 0.8503 ⭐ | Excellent - catches 85% of actual cases |
|
179 |
+
| **Specificity** | 0.5467 | Moderate - 45% false positive rate |
|
180 |
+
| **Precision** | 0.6520 | When model says "high risk", it's correct 65% of time |
|
181 |
+
| **F1-Score** | 0.7381 | Balanced performance measure |
|
182 |
+
|
183 |
+
#### Summary
|
184 |
+
|
185 |
+
The model achieves its primary objective of high sensitivity (85.03%), making it effective for screening applications where missing a positive case is more costly than having false positives. However, the trade-off is a high false positive rate (45.3%), requiring careful implementation with appropriate follow-up protocols.
|
186 |
+
|
187 |
+
## Model Examination
|
188 |
+
|
189 |
+
The ensemble approach provides built-in interpretability through:
|
190 |
+
- Feature importance rankings from tree-based models
|
191 |
+
- Coefficient analysis from logistic regression
|
192 |
+
- Individual model predictions for transparency
|
193 |
+
- SHAP values can be computed for individual predictions
|
194 |
+
|
195 |
+
## Environmental Impact
|
196 |
+
|
197 |
+
- **Hardware Type:** Standard CPU (no GPU required)
|
198 |
+
- **Training Time:** < 1 hour on standard hardware
|
199 |
+
- **Model Size:** < 10 MB
|
200 |
+
- **Inference:** Real-time prediction capability
|
201 |
+
- **Carbon Footprint:** Minimal due to efficient algorithms and small dataset
|
202 |
+
|
203 |
+
## Technical Specifications
|
204 |
+
|
205 |
+
### Model Architecture and Objective
|
206 |
+
|
207 |
+
- **Architecture:** Ensemble of 4 base classifiers with soft voting
|
208 |
+
- **Objective:** Binary classification optimized for maximum sensitivity
|
209 |
+
- **Input:** 20 numerical features (13 clinical + 7 engineered)
|
210 |
+
- **Output:** Risk probability + binary classification with optimized threshold
|
211 |
+
|
212 |
+
### Compute Infrastructure
|
213 |
+
|
214 |
+
#### Hardware
|
215 |
+
|
216 |
+
- Standard CPU sufficient for training and inference
|
217 |
+
- Memory requirements: < 1 GB RAM
|
218 |
+
- No specialized hardware needed
|
219 |
+
|
220 |
+
#### Software
|
221 |
+
|
222 |
+
- Python 3.7+
|
223 |
+
- scikit-learn
|
224 |
+
- XGBoost
|
225 |
+
- pandas, numpy
|
226 |
+
- joblib for model serialization
|
227 |
+
|
228 |
+
## Citation
|
229 |
+
|
230 |
+
**BibTeX:**
|
231 |
+
|
232 |
+
```bibtex
|
233 |
+
@misc{heart_risk_ai_v3,
|
234 |
+
title={Heart Risk AI: Cardiovascular Risk Prediction Model v3},
|
235 |
+
author={[Your Name]},
|
236 |
+
year={2024},
|
237 |
+
note={Ensemble model optimized for high sensitivity in cardiovascular screening}
|
238 |
+
}
|
239 |
+
```
|
240 |
+
|
241 |
+
## Glossary
|
242 |
+
|
243 |
+
- **Sensitivity (Recall):** Proportion of actual positive cases correctly identified
|
244 |
+
- **Specificity:** Proportion of actual negative cases correctly identified
|
245 |
+
- **SMOTE:** Synthetic Minority Over-sampling Technique for handling imbalanced data
|
246 |
+
- **Ensemble:** Combination of multiple models to improve prediction performance
|
247 |
+
- **ROC-AUC:** Receiver Operating Characteristic - Area Under Curve, measures overall discriminative ability
|
248 |
+
|
249 |
+
## Model Card Authors
|
250 |
+
|
251 |
+
[Juan Manuel Infante Quiroga]
|
252 |
+
|
253 |
+
## Model Card Contact
|
254 |
+
|
255 |
+
[https://www.linkedin.com/in/juaninfantequiroga/]
|