reysarms's picture
updated app.py
d6e5301
import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
# Load dataset
df = pd.read_csv("ResearchInformation3.csv")
# Preprocessing
categorical_cols = ['Department', 'Gender', 'Income', 'Hometown', 'Preparation', 'Gaming',
'Attendance', 'Job', 'Extra', 'Semester']
le = LabelEncoder()
for col in categorical_cols:
df[col] = le.fit_transform(df[col])
# Create performance labels based on GPA
conditions = [df['Overall'] >= 3.8, df['Overall'] >= 3.0, df['Overall'] < 3.0]
choices = ['High', 'Medium', 'Low']
df['PerformanceLabel'] = np.select(conditions, choices, default='Unknown').astype(object)
# Scale numerical features for clustering
features = df.drop(columns=['Overall', 'PerformanceLabel'])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)
# Apply KMeans Clustering to group similar students
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled)
# Train a Random Forest Classifier using original features + cluster labels
X = df.drop(columns=['Overall', 'PerformanceLabel'])
y = df['PerformanceLabel']
clf = RandomForestClassifier(random_state=42)
clf.fit(X, y)
y_pred = clf.predict(X)
# Streamlit App Configuration
st.set_page_config(layout="wide")
tabs = st.tabs(["About the App", "Dataset & Results", "Performance Checker"])
# Page 1: About the App
with tabs[0]:
st.title("๐ŸŽ“ Student Performance Analysis App")
st.markdown("""
Welcome to the **Student Performance Analysis App**! This app uses a combination of **Unsupervised Learning (KMeans Clustering)** and **Supervised Learning (Random Forest Classification)** to analyze student academic and behavioral data. Here's how it works:
- **Clustering**: The KMeans algorithm groups students based on their traits and behavior, identifying clusters of students with similar characteristics. This helps us uncover patterns within the data.
- **Classification**: The Random Forest classifier predicts the performance (GPA) of studentsโ€”categorized into **High**, **Medium**, or **Low**โ€”based on their features and cluster group.
### Key Features:
- **Clustering** of students into distinct groups based on their behavior and performance data.
- **Classification** of students into performance labels (High, Medium, Low GPA).
- **Visualizations** and detailed analyses to help understand the data and model predictions.
This tool aims to help educators, researchers, and students themselves identify trends and make data-driven decisions.
""")
# Page 2: Dataset & Results
with tabs[1]:
st.header("๐Ÿ“Š Dataset Overview")
st.markdown("""
The dataset contains various features related to students' academic performance and behavioral characteristics. The features have been preprocessed, including encoding of categorical variables and clustering of students using KMeans.
Below are the first few rows of the dataset after preprocessing, where categorical variables have been converted into numerical values, and clusters have been assigned to each student based on their behavior and academic performance.
""")
st.dataframe(df.head())
st.subheader("๐ŸŽฏ Cluster Distribution")
st.markdown("""
The **Cluster Distribution** chart shows how the students are distributed across different clusters identified by the KMeans algorithm. Each cluster represents a group of students with similar attributes, such as academic performance and behavioral characteristics.
""")
fig1, ax1 = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='Cluster', hue='Cluster', palette='Set2', ax=ax1)
ax1.set_title("Number of Students per Cluster")
ax1.set_xlabel("Cluster")
ax1.set_ylabel("Count")
st.pyplot(fig1)
st.subheader("๐Ÿ“Š Cluster Feature Comparison")
st.markdown("""
The **Cluster Feature Comparison** chart compares the **Attendance** feature across different student clusters. This visualization helps us understand how attendance varies between the different clusters, which can indicate patterns in student behavior.
""")
fig2, ax2 = plt.subplots(figsize=(8, 6))
sns.boxplot(x='Cluster', y='Attendance', data=df, hue='Cluster', palette='Set2', ax=ax2)
ax2.set_title("Cluster-wise Distribution of Attendance")
st.pyplot(fig2)
st.subheader("๐Ÿ“ˆ Performance Comparison Across Clusters")
st.markdown("""
The **Performance Comparison Across Clusters** chart shows the distribution of students' performance labels (High, Medium, Low GPA) within each cluster. This helps to visualize how performance is spread across the clusters and identify trends in academic achievement.
""")
fig3, ax3 = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='PerformanceLabel', hue='Cluster', ax=ax3)
ax3.set_title("Performance Distribution per Cluster")
st.pyplot(fig3)
st.subheader("๐Ÿ“ˆ Top 5 Confusion Matrix Entries")
st.markdown("""
The **Confusion Matrix** shows the performance of the Random Forest classifier in predicting students' performance labels (High, Medium, Low GPA). It compares the actual vs predicted values, helping us assess how well the model is performing.
""")
cm = confusion_matrix(y, y_pred, labels=['High', 'Medium', 'Low'])
cm_df = pd.DataFrame(cm, index=['High', 'Medium', 'Low'], columns=['High', 'Medium', 'Low'])
st.dataframe(cm_df)
st.subheader("๐ŸŽฏ Model Performance Metrics")
st.markdown("""
Here are the performance metrics of the Random Forest classifier. The **classification report** provides detailed information on precision, recall, and F1-score for each performance category (High, Medium, Low GPA).
""")
st.text(classification_report(y, y_pred, target_names=['High', 'Medium', 'Low']))
# Page 3: Performance Checker
with tabs[2]:
st.header("๐Ÿ” Student Performance Checker")
st.markdown("""
Use this feature to check the predicted performance and assigned cluster for a specific student. You can select a student by their index, and the app will display their features, the predicted performance level (High, Medium, Low), and the cluster group they belong to.
""")
student_idx = st.slider("Select Student Index", 0, len(df)-1, 0, key="student_index_slider")
student = df.iloc[student_idx]
st.subheader("๐Ÿ“Œ Selected Student Features")
st.markdown("""
These are the features of the selected student, excluding the overall GPA and performance label. This data helps to understand the individual characteristics of the student.
""")
st.dataframe(student.drop(['Overall', 'PerformanceLabel']).to_frame().T)
st.subheader("๐Ÿ“Š Prediction Summary")
st.markdown("""
Based on the features of the selected student, the **Predicted Performance Level** and **Assigned Cluster** are shown below. This prediction is made using the Random Forest classifier and KMeans clustering.
""")
st.success(f"Predicted Performance Level: **{y_pred[student_idx]}**")
st.info(f"Assigned Cluster Group: **{student['Cluster']}**")
st.subheader("๐Ÿ”Ž Feature Importance")
st.markdown("""
The **Feature Importance** chart shows the top 5 most important features used by the Random Forest classifier to predict the student's performance. Higher importance features are more influential in making predictions.
""")
feature_importances = pd.Series(clf.feature_importances_, index=X.columns)
top_features = feature_importances.sort_values(ascending=False).head(5)
st.bar_chart(top_features)
st.subheader("๐Ÿ“Š Feature Comparison for Selected Student")
st.markdown("""
The **Feature Comparison** table compares the values of each feature for the selected student. This allows for a deeper understanding of how the student's attributes contribute to the performance prediction.
""")
feature_comparison = pd.DataFrame(student.drop(['Overall', 'PerformanceLabel']).values,
columns=["Value"], index=student.drop(['Overall', 'PerformanceLabel']).index)
st.dataframe(feature_comparison)
st.subheader("๐Ÿ“ˆ Cluster Visualization")
st.markdown("""
The **Cluster Visualization** scatter plot shows how students are distributed in the 2D space of **Attendance** vs **Income**. Each point represents a student, colored by their assigned cluster group.
""")
fig4, ax4 = plt.subplots(figsize=(8, 6))
sns.scatterplot(x='Attendance', y='Income', hue='Cluster', data=df, palette='Set2', ax=ax4)
ax4.set_title("Cluster Distribution in Attendance vs Income")
st.pyplot(fig4)