Juan12Dev commited on
Commit
1fdf615
·
verified ·
1 Parent(s): 9c3bb8b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +252 -0
README.md CHANGED
@@ -1,3 +1,255 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - sklearn
5
+ - ensemble
6
+ - heart-disease
7
+ - cardiovascular-risk
8
+ - imbalanced-data
9
+ - explainable-ai
10
+ metrics:
11
+ - roc_auc
12
+ - accuracy
13
+ - recall
14
+ - precision
15
+ - f1
16
+ language:
17
+ - en
18
  ---
19
+
20
+ # Model Card for Heart Risk AI - Cardiovascular Risk Prediction v3
21
+
22
+ This is an advanced ensemble model designed to predict the risk of cardiovascular disease, optimized for high sensitivity to minimize missed diagnoses in medical screening scenarios.
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+
28
+ This model combines four powerful machine learning algorithms (Random Forest, XGBoost, Gradient Boosting, and Logistic Regression) in an ensemble approach to predict cardiovascular disease risk. The model is specifically optimized for high sensitivity (recall), making it suitable for initial screening where detecting all potential cases is prioritized over minimizing false positives.
29
+
30
+ - **Developed by:** [Juan Manuel Infante Quiroga]
31
+ - **Model type:** Ensemble Classification (Random Forest + XGBoost + Gradient Boosting + Logistic Regression)
32
+ - **Language(s):** Python (scikit-learn)
33
+ - **License:** Apache 2.0
34
+ - **Version:** v3 (optimized for high sensitivity)
35
+
36
+ ### Model Sources
37
+
38
+ - **Repository:** [https://github.com/JuanInfante122/heart-risk-ai.git]
39
+ - **Dataset:** UCI Heart Disease Dataset (Cleveland Clinic Foundation)
40
+ - **Framework:** scikit-learn, XGBoost
41
+
42
+ ## Uses
43
+
44
+ ### Direct Use
45
+
46
+ This model is intended for **educational and research purposes only**. It can be used as a screening tool for:
47
+
48
+ - Initial cardiovascular risk assessment in populations over 40 years of age
49
+ - Triage support in emergency services
50
+ - Risk factor monitoring for follow-up patients
51
+ - Patient education and awareness programs
52
+
53
+ **⚠️ IMPORTANT: This tool does NOT replace complete clinical evaluation, specific diagnostic tests, or professional medical judgment.**
54
+
55
+ ### Downstream Use
56
+
57
+ The model can be integrated into:
58
+ - Healthcare screening applications
59
+ - Clinical decision support systems (as a supplementary tool)
60
+ - Research studies on cardiovascular risk prediction
61
+ - Educational platforms for medical training
62
+
63
+ ### Out-of-Scope Use
64
+
65
+ - **Definitive diagnosis:** Never use for final diagnostic decisions
66
+ - **Treatment decisions:** Not suitable for determining treatment protocols
67
+ - **Standalone medical advice:** Requires professional medical interpretation
68
+ - **Real-time emergency decisions:** Not validated for acute care settings
69
+
70
+ ## Bias, Risks, and Limitations
71
+
72
+ ### Known Limitations
73
+
74
+ 1. **Population Bias:** Training data primarily from Caucasian population, limiting applicability to other ethnic groups
75
+ 2. **Missing Biomarkers:** Does not include advanced biomarkers (troponins, BNP, etc.)
76
+ 3. **Medication Status:** Does not consider patient's current medication regimen
77
+ 4. **Temporal Validation:** Requires validation in prospective cohorts
78
+ 5. **High False Positive Rate:** 45.3% of healthy patients may be incorrectly flagged as at-risk
79
+
80
+ ### Technical Risks
81
+
82
+ - **Low Specificity (54.7%):** High number of false positives may cause unnecessary anxiety and healthcare costs
83
+ - **Dataset Size:** Limited training data (303 patients) may affect generalizability
84
+ - **Feature Engineering:** Model relies on 7 engineered features that may not generalize to other datasets
85
+
86
+ ### Recommendations
87
+
88
+ - Use only as a preliminary screening tool, never for final diagnosis
89
+ - Always combine with clinical judgment and additional testing
90
+ - Consider the high false positive rate when interpreting results
91
+ - Validate performance on diverse populations before broader deployment
92
+ - Implement appropriate patient counseling protocols for positive predictions
93
+
94
+ ## How to Get Started with the Model
95
+
96
+ ```python
97
+ import joblib
98
+ import pandas as pd
99
+
100
+ # Load the model
101
+ model_data = joblib.load('models/heart_risk_ensemble_v3.pkl')
102
+ ensemble_model = model_data['ensemble_model']
103
+ scaler = model_data['scaler']
104
+ feature_names = model_data['feature_names']
105
+ optimal_threshold = model_data['optimal_threshold']
106
+
107
+ # Example patient data
108
+ patient_data = {
109
+ 'age': 65, 'sex': 1, 'cp': 1, 'trestbps': 150, 'chol': 280,
110
+ 'fbs': 1, 'restecg': 1, 'thalach': 120, 'exang': 1, 'oldpeak': 3.0,
111
+ 'slope': 1, 'ca': 2, 'thal': 7,
112
+ 'feature_14': 0, 'feature_15': 0, 'feature_16': 0,
113
+ 'feature_17': 0, 'feature_18': 0, 'feature_19': 0, 'feature_20': 0
114
+ }
115
+
116
+ # Prepare and predict
117
+ patient_df = pd.DataFrame([patient_data], columns=feature_names)
118
+ patient_scaled = scaler.transform(patient_df)
119
+ probability = ensemble_model.predict_proba(patient_scaled)[0, 1]
120
+ prediction = (probability >= optimal_threshold).astype(int)
121
+
122
+ print(f"Risk Probability: {probability:.2f}")
123
+ print(f"High Risk: {'Yes' if prediction else 'No'}")
124
+ ```
125
+
126
+ ## Training Details
127
+
128
+ ### Training Data
129
+
130
+ - **Source:** UCI Heart Disease Dataset from Cleveland Clinic Foundation
131
+ - **Size:** 303 patient records
132
+ - **Features:** 13 original clinical features + 7 engineered features
133
+ - **Target:** Binary classification (presence/absence of heart disease)
134
+ - **Preprocessing:** SMOTE (Synthetic Minority Over-sampling Technique) applied to handle class imbalance
135
+
136
+ ### Training Procedure
137
+
138
+ #### Preprocessing
139
+
140
+ 1. Feature scaling using StandardScaler
141
+ 2. SMOTE oversampling for class balance
142
+ 3. Feature engineering to create 7 additional predictive features
143
+ 4. Train/test split: 80/20
144
+
145
+ #### Training Hyperparameters
146
+
147
+ - **Training regime:** Standard precision (fp32)
148
+ - **Ensemble method:** Voting classifier with soft voting
149
+ - **Cross-validation:** 5-fold stratified cross-validation
150
+ - **Optimization target:** Maximized sensitivity (recall)
151
+ - **Threshold optimization:** Custom threshold tuning for high sensitivity
152
+
153
+ ## Evaluation
154
+
155
+ ### Testing Data, Factors & Metrics
156
+
157
+ #### Testing Data
158
+
159
+ - **Size:** 20% holdout set from UCI Heart Disease Dataset
160
+ - **Evaluation method:** Stratified sampling to maintain class distribution
161
+
162
+ #### Metrics
163
+
164
+ Performance optimized for medical screening scenarios where high sensitivity is critical:
165
+
166
+ - **ROC-AUC:** Area under the receiver operating characteristic curve
167
+ - **Sensitivity (Recall):** True positive rate - primary optimization target
168
+ - **Specificity:** True negative rate
169
+ - **Precision:** Positive predictive value
170
+ - **F1-Score:** Harmonic mean of precision and recall
171
+
172
+ ### Results
173
+
174
+ | Metric | Value | Interpretation |
175
+ |--------|-------|----------------|
176
+ | **AUC-ROC** | 0.7986 | Good discriminative ability |
177
+ | **Accuracy** | 0.6984 | Overall correct predictions |
178
+ | **Sensitivity (Recall)** | 0.8503 ⭐ | Excellent - catches 85% of actual cases |
179
+ | **Specificity** | 0.5467 | Moderate - 45% false positive rate |
180
+ | **Precision** | 0.6520 | When model says "high risk", it's correct 65% of time |
181
+ | **F1-Score** | 0.7381 | Balanced performance measure |
182
+
183
+ #### Summary
184
+
185
+ The model achieves its primary objective of high sensitivity (85.03%), making it effective for screening applications where missing a positive case is more costly than having false positives. However, the trade-off is a high false positive rate (45.3%), requiring careful implementation with appropriate follow-up protocols.
186
+
187
+ ## Model Examination
188
+
189
+ The ensemble approach provides built-in interpretability through:
190
+ - Feature importance rankings from tree-based models
191
+ - Coefficient analysis from logistic regression
192
+ - Individual model predictions for transparency
193
+ - SHAP values can be computed for individual predictions
194
+
195
+ ## Environmental Impact
196
+
197
+ - **Hardware Type:** Standard CPU (no GPU required)
198
+ - **Training Time:** < 1 hour on standard hardware
199
+ - **Model Size:** < 10 MB
200
+ - **Inference:** Real-time prediction capability
201
+ - **Carbon Footprint:** Minimal due to efficient algorithms and small dataset
202
+
203
+ ## Technical Specifications
204
+
205
+ ### Model Architecture and Objective
206
+
207
+ - **Architecture:** Ensemble of 4 base classifiers with soft voting
208
+ - **Objective:** Binary classification optimized for maximum sensitivity
209
+ - **Input:** 20 numerical features (13 clinical + 7 engineered)
210
+ - **Output:** Risk probability + binary classification with optimized threshold
211
+
212
+ ### Compute Infrastructure
213
+
214
+ #### Hardware
215
+
216
+ - Standard CPU sufficient for training and inference
217
+ - Memory requirements: < 1 GB RAM
218
+ - No specialized hardware needed
219
+
220
+ #### Software
221
+
222
+ - Python 3.7+
223
+ - scikit-learn
224
+ - XGBoost
225
+ - pandas, numpy
226
+ - joblib for model serialization
227
+
228
+ ## Citation
229
+
230
+ **BibTeX:**
231
+
232
+ ```bibtex
233
+ @misc{heart_risk_ai_v3,
234
+ title={Heart Risk AI: Cardiovascular Risk Prediction Model v3},
235
+ author={[Your Name]},
236
+ year={2024},
237
+ note={Ensemble model optimized for high sensitivity in cardiovascular screening}
238
+ }
239
+ ```
240
+
241
+ ## Glossary
242
+
243
+ - **Sensitivity (Recall):** Proportion of actual positive cases correctly identified
244
+ - **Specificity:** Proportion of actual negative cases correctly identified
245
+ - **SMOTE:** Synthetic Minority Over-sampling Technique for handling imbalanced data
246
+ - **Ensemble:** Combination of multiple models to improve prediction performance
247
+ - **ROC-AUC:** Receiver Operating Characteristic - Area Under Curve, measures overall discriminative ability
248
+
249
+ ## Model Card Authors
250
+
251
+ [Juan Manuel Infante Quiroga]
252
+
253
+ ## Model Card Contact
254
+
255
+ [https://www.linkedin.com/in/juaninfantequiroga/]