Spaces:

reysarms
/

student_performance_analysis

Sleeping

App Files Files Community

student_performance_analysis / app.py

reysarms

updated app.py

d6e5301 4 months ago

raw

history blame contribute delete

8.99 kB

	import streamlit as st
	import pandas as pd
	import numpy as np
	import matplotlib.pyplot as plt
	import seaborn as sns
	from sklearn.preprocessing import LabelEncoder, StandardScaler
	from sklearn.cluster import KMeans
	from sklearn.ensemble import RandomForestClassifier
	from sklearn.metrics import confusion_matrix, classification_report

	# Load dataset
	df = pd.read_csv("ResearchInformation3.csv")

	# Preprocessing
	categorical_cols = ['Department', 'Gender', 'Income', 'Hometown', 'Preparation', 'Gaming',
	'Attendance', 'Job', 'Extra', 'Semester']
	le = LabelEncoder()
	for col in categorical_cols:
	df[col] = le.fit_transform(df[col])

	# Create performance labels based on GPA
	conditions = [df['Overall'] >= 3.8, df['Overall'] >= 3.0, df['Overall'] < 3.0]
	choices = ['High', 'Medium', 'Low']
	df['PerformanceLabel'] = np.select(conditions, choices, default='Unknown').astype(object)

	# Scale numerical features for clustering
	features = df.drop(columns=['Overall', 'PerformanceLabel'])
	scaler = StandardScaler()
	X_scaled = scaler.fit_transform(features)

	# Apply KMeans Clustering to group similar students
	kmeans = KMeans(n_clusters=3, random_state=42)
	df['Cluster'] = kmeans.fit_predict(X_scaled)

	# Train a Random Forest Classifier using original features + cluster labels
	X = df.drop(columns=['Overall', 'PerformanceLabel'])
	y = df['PerformanceLabel']
	clf = RandomForestClassifier(random_state=42)
	clf.fit(X, y)
	y_pred = clf.predict(X)

	# Streamlit App Configuration
	st.set_page_config(layout="wide")
	tabs = st.tabs(["About the App", "Dataset & Results", "Performance Checker"])

	# Page 1: About the App
	with tabs[0]:
	st.title("🎓 Student Performance Analysis App")
	st.markdown("""
	Welcome to the Student Performance Analysis App! This app uses a combination of Unsupervised Learning (KMeans Clustering) and Supervised Learning (Random Forest Classification) to analyze student academic and behavioral data. Here's how it works:

	- Clustering: The KMeans algorithm groups students based on their traits and behavior, identifying clusters of students with similar characteristics. This helps us uncover patterns within the data.
	- Classification: The Random Forest classifier predicts the performance (GPA) of students—categorized into High, Medium, or Low—based on their features and cluster group.

	### Key Features:
	- Clustering of students into distinct groups based on their behavior and performance data.
	- Classification of students into performance labels (High, Medium, Low GPA).
	- Visualizations and detailed analyses to help understand the data and model predictions.

	This tool aims to help educators, researchers, and students themselves identify trends and make data-driven decisions.
	""")

	# Page 2: Dataset & Results
	with tabs[1]:
	st.header("📊 Dataset Overview")
	st.markdown("""
	The dataset contains various features related to students' academic performance and behavioral characteristics. The features have been preprocessed, including encoding of categorical variables and clustering of students using KMeans.

	Below are the first few rows of the dataset after preprocessing, where categorical variables have been converted into numerical values, and clusters have been assigned to each student based on their behavior and academic performance.
	""")
	st.dataframe(df.head())

	st.subheader("🎯 Cluster Distribution")
	st.markdown("""
	The Cluster Distribution chart shows how the students are distributed across different clusters identified by the KMeans algorithm. Each cluster represents a group of students with similar attributes, such as academic performance and behavioral characteristics.
	""")
	fig1, ax1 = plt.subplots(figsize=(8, 6))
	sns.countplot(data=df, x='Cluster', hue='Cluster', palette='Set2', ax=ax1)
	ax1.set_title("Number of Students per Cluster")
	ax1.set_xlabel("Cluster")
	ax1.set_ylabel("Count")
	st.pyplot(fig1)

	st.subheader("📊 Cluster Feature Comparison")
	st.markdown("""
	The Cluster Feature Comparison chart compares the Attendance feature across different student clusters. This visualization helps us understand how attendance varies between the different clusters, which can indicate patterns in student behavior.
	""")
	fig2, ax2 = plt.subplots(figsize=(8, 6))
	sns.boxplot(x='Cluster', y='Attendance', data=df, hue='Cluster', palette='Set2', ax=ax2)
	ax2.set_title("Cluster-wise Distribution of Attendance")
	st.pyplot(fig2)

	st.subheader("📈 Performance Comparison Across Clusters")
	st.markdown("""
	The Performance Comparison Across Clusters chart shows the distribution of students' performance labels (High, Medium, Low GPA) within each cluster. This helps to visualize how performance is spread across the clusters and identify trends in academic achievement.
	""")
	fig3, ax3 = plt.subplots(figsize=(8, 6))
	sns.countplot(data=df, x='PerformanceLabel', hue='Cluster', ax=ax3)
	ax3.set_title("Performance Distribution per Cluster")
	st.pyplot(fig3)

	st.subheader("📈 Top 5 Confusion Matrix Entries")
	st.markdown("""
	The Confusion Matrix shows the performance of the Random Forest classifier in predicting students' performance labels (High, Medium, Low GPA). It compares the actual vs predicted values, helping us assess how well the model is performing.
	""")
	cm = confusion_matrix(y, y_pred, labels=['High', 'Medium', 'Low'])
	cm_df = pd.DataFrame(cm, index=['High', 'Medium', 'Low'], columns=['High', 'Medium', 'Low'])
	st.dataframe(cm_df)

	st.subheader("🎯 Model Performance Metrics")
	st.markdown("""
	Here are the performance metrics of the Random Forest classifier. The classification report provides detailed information on precision, recall, and F1-score for each performance category (High, Medium, Low GPA).
	""")
	st.text(classification_report(y, y_pred, target_names=['High', 'Medium', 'Low']))

	# Page 3: Performance Checker
	with tabs[2]:
	st.header("🔍 Student Performance Checker")
	st.markdown("""
	Use this feature to check the predicted performance and assigned cluster for a specific student. You can select a student by their index, and the app will display their features, the predicted performance level (High, Medium, Low), and the cluster group they belong to.
	""")

	student_idx = st.slider("Select Student Index", 0, len(df)-1, 0, key="student_index_slider")
	student = df.iloc[student_idx]

	st.subheader("📌 Selected Student Features")
	st.markdown("""
	These are the features of the selected student, excluding the overall GPA and performance label. This data helps to understand the individual characteristics of the student.
	""")
	st.dataframe(student.drop(['Overall', 'PerformanceLabel']).to_frame().T)

	st.subheader("📊 Prediction Summary")
	st.markdown("""
	Based on the features of the selected student, the Predicted Performance Level and Assigned Cluster are shown below. This prediction is made using the Random Forest classifier and KMeans clustering.
	""")
	st.success(f"Predicted Performance Level: {y_pred[student_idx]}")
	st.info(f"Assigned Cluster Group: {student['Cluster']}")

	st.subheader("🔎 Feature Importance")
	st.markdown("""
	The Feature Importance chart shows the top 5 most important features used by the Random Forest classifier to predict the student's performance. Higher importance features are more influential in making predictions.
	""")
	feature_importances = pd.Series(clf.feature_importances_, index=X.columns)
	top_features = feature_importances.sort_values(ascending=False).head(5)
	st.bar_chart(top_features)

	st.subheader("📊 Feature Comparison for Selected Student")
	st.markdown("""
	The Feature Comparison table compares the values of each feature for the selected student. This allows for a deeper understanding of how the student's attributes contribute to the performance prediction.
	""")
	feature_comparison = pd.DataFrame(student.drop(['Overall', 'PerformanceLabel']).values,
	columns=["Value"], index=student.drop(['Overall', 'PerformanceLabel']).index)
	st.dataframe(feature_comparison)

	st.subheader("📈 Cluster Visualization")
	st.markdown("""
	The Cluster Visualization scatter plot shows how students are distributed in the 2D space of Attendance vs Income. Each point represents a student, colored by their assigned cluster group.
	""")
	fig4, ax4 = plt.subplots(figsize=(8, 6))
	sns.scatterplot(x='Attendance', y='Income', hue='Cluster', data=df, palette='Set2', ax=ax4)
	ax4.set_title("Cluster Distribution in Attendance vs Income")
	st.pyplot(fig4)