Exploratory Data Analyis of MasteryConnect Student Test Data

Exploratory Data Analyis of MasteryConnect Student Test Data#

Student Data analysis

Problem Statement#

Teachers make on average, around 1500 decisions per day link This may seem like a dramatic statistic to some people outside the profession, but it is an enduring reality for those teachers who give their all in pursuit of positive academic outcomes. Tracking and monitoring student performance is crucial to acheiving this, however, manually doing so throughout the year, is not only challenging, but quite difficult to sustain. This project aims to alleviate this burden by:

Collecting CSV assessment data from Mastery Connect
Processing and uploading datasets to MongoDB
Generating targeted insights using MongoDB’s aggregation pipelines and pandas (a Python data manipulation library)
Presenting these insights through intuitive visualizations

The key performance indicators generated by this system, such as standard mastery rates, question type proficiency, and individual student growth trajectories are invaluable insights that will empower teachers to implement data-driven classroom strategies, help students understand their progress, and allow administrators to identify broader trends.

Designed with scalability in mind, the system can potentially extend to multiple classrooms or entire school districts, significantly reducing decision fatigue for educators.

Analysis Topic	Question	Approach
Overall Performance Analysis	What is the overall distribution of student performance across all tests?	Analyze descriptive statistics of overall scores and visualize the distribution.
Standard-Based Performance Analysis	How do students perform across different biology standards?	a) Calculate and visualize average performance for each standard. b) Identify top and bottom performing students for each standard. c) Rank standards by difficulty based on average performance.
Question Type Analysis	How does performance vary across different question types?	Calculate and visualize difficulty rankings for different question types.
Performance Analysis Over Time	How has student performance changed over time?	Analyze and visualize performance trends over time for all students.
Test Participation Analysis	Are there patterns in test participation or missed tests?	Identify and analyze data for students who did not complete certain tests.
Depth of Knowledge (DOK) Analysis	How does performance correlate with the depth of knowledge required?	Analyze performance across different DOK levels.
Class and Teacher Impact Analysis	Are there significant performance differences across classes or teachers?	Compare performance metrics across different classes and teachers.
Standard Coverage Analysis	Which standards are most frequently tested, and how does this relate to performance?	Analyze the frequency of questions for each standard and correlate with performance data.
Individual Student Progress Tracking	How can we visualize and analyze individual student progress over time?	Develop interactive visualizations for tracking individual student performance across tests and standards.
Test Design Analysis	Are there patterns in test composition that correlate with overall performance?	Analyze the relationship between test characteristics (e.g., question types, standards covered) and overall performance.

Imports#

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
from dotenv import load_dotenv
from IPython.display import display

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from src.components.db_connection import get_database, build_query 
from src.pipelines import mongo_db_pipelines as m_db
from src.components.data_ingestion import drop_collection

Initialize database#

db = get_database(create_indices=True)
students = db['students']
tests = db['tests']

[ 2024-08-03 08:48:18,173 ] 29 my_logger - INFO - Successfully connected to MongoDB

Filter data for particular teacher(s) or time period#

load_dotenv()
teacher = os.getenv('TEACHER_2')
class_name = ''
start_date = ''

query = build_query(teacher=teacher, class_name=class_name, start_date=start_date)

Statistical analysis#

numeric data across all classes#

from src.utils import describe_numeric_field_pandas
# Average scores on tests
overall_scores_stats = describe_numeric_field_pandas(students, 'overall_percentage', query)


display(pd.DataFrame(overall_scores_stats))

	0
count	7729.000000
mean	63.818735
std	22.409500
min	0.000000
25%	47.000000
50%	67.000000
75%	80.000000
max	100.000000
skewness	-0.287499
kurtosis	-0.727309

Frequency distribution across all classes#

This shows the distribution of fields (count values) by collection

from src.utils import get_frequency_distribution

# MongoDB pipeline for frequency distribution
standard_dist = get_frequency_distribution(db=db, field='questions', subfield='standard', teacher=teacher)
item_type_dist = get_frequency_distribution(db=db, field='questions', subfield='item_type_name', teacher=teacher)
dok_dist = get_frequency_distribution(db=db, field='questions', subfield='dok', teacher=teacher)

# Renaming primary key 
standard_dist = pd.DataFrame(standard_dist).rename(columns={'_id':'standards'})
item_type_dist = pd.DataFrame(item_type_dist).rename(columns={'_id':'item_type'})
dok_dist = pd.DataFrame(dok_dist).rename(columns={'_id':'dok'})

print("Frequency of Course Standards")
display(pd.DataFrame(standard_dist).T)
print("Distribution of Question types")
display(pd.DataFrame(item_type_dist))
print("Depth of knowledge Frequency")
display(pd.DataFrame(dok_dist))

Frequency of Course Standards

	0	1	2	3	4	5	6	7	8	9	...	94	95	96	97	98	99	100	101	102	103
standards	Bio.4.1.1	Bio.3.2.2	Bio.1.1.1	Bio.4.2.1	Bio.1.1.2	Chm.1.1.1	Bio.4.1.3	Bio.1.2.1	Bio.1.2.2	Bio.3.2.1	...	EEn.2.3.2	PSc.3.3.5	EEn.2.2.2	PSc.3.3.1	EEn.2.4.2	Chm.1.2.2	PSc.2.3.2	EEn.2.1.2	Chm.1.2.3	Chm.3.2.2
count	384	363	308	304	295	281	275	239	205	202	...	6	6	4	4	4	4	2	2	1	1

2 rows × 104 columns

Distribution of Question types

	item_type	count
0	Multiple Choice	3494
1	Multiple Choice Question	2061
2	Multiple Choice (Single Response)	318
3	Match List	122
4	Label Image with Drag and Drop	86
5	Choice Matrix	20
6	Cloze Association	16
7	Multi Select	15
8	Order List	15
9	Highlight	12
10	Sort List	3
11	Cloze Dropdown	2
12	Classification	2

Depth of knowledge Frequency

	dok	count
0	NaN	4978
1	2	626
2	3	303
3	1	251
4	4	8

Distribution Visualizations#

from src.utils import create_frequency_distributions_plot
fig_distributions = create_frequency_distributions_plot(standard_dist, item_type_dist, dok_dist)
display(fig_distributions)

Performance over time for all students#

Aggreation of test results by date, not including students who missed a test (NaN values filtered out)

from src.utils import performance_trend

# Run the performance trend analysis
performance_trend_df = pd.DataFrame(performance_trend(students, query))
print('First 5 tests by date and average')
display(performance_trend_df.head(5))

print('Last 5 tests by date and average')
display(performance_trend_df.tail(5))

First 5 tests by date and average

	date	count	avg_performance	min_value	max_value
0	2016-09-29	80	62.575000	25.0	94.0
1	2016-10-04	74	33.918919	5.0	65.0
2	2016-10-18	80	69.200000	17.0	92.0
3	2016-10-27	84	73.642857	33.0	100.0
4	2016-11-14	72	37.861111	7.0	78.0

Last 5 tests by date and average

	date	count	avg_performance	min_value	max_value
142	2024-05-14	16	61.625000	27.0	72.0
143	2024-05-15	12	81.166667	73.0	90.0
144	2024-05-17	16	73.375000	48.0	88.0
145	2024-05-23	25	55.720000	22.0	85.0
146	2024-05-24	14	62.357143	35.0	78.0

Vizualization for performance trend overtime#

from src.utils import create_performance_trend_plot
fig_performance = create_performance_trend_plot(performance_trend_df)
display(fig_performance)

Students who did not complete a test#

from src.utils import identify_students_with_nan, check_responses_for_nan

missed_test = identify_students_with_nan(students, 'overall_percentage')

# convert to data frame
missed_test_df = pd.DataFrame(missed_test)

Anonymizing Data#

import random
from faker import Faker
fake = Faker()

# Custom function to generate fake test IDs
def fake_test_id():
    class_names = ['Biology', 'Chemistry', 'Anatomy']
    test_names = ['Midterm', 'Final', 'Quiz', 'Project', 'Exam']
    
    class_name = random.choice(class_names)
    test_name = random.choice(test_names)
    fake_date = fake.date_between(start_date='-1y', end_date='today').strftime('%Y%m%d')
    
    return f"{class_name}_{test_name}_{fake_date}"

# Create mappings for each column
def create_mapping(original_values, fake_function):
    return {value: fake_function() for value in set(original_values)}

student_id_mapping = create_mapping(missed_test_df['student_id'], lambda: f"S-{fake.numerify('######')}")
first_name_mapping = create_mapping(missed_test_df['first_name'], fake.first_name)
last_name_mapping = create_mapping(missed_test_df['last_name'], fake.last_name)
teacher_mapping = create_mapping(missed_test_df['teacher'], fake.name)
test_id_mapping = create_mapping(missed_test_df['test_id'], fake_test_id)

# Apply mappings to create anonymized dataframe
anonymized_df = missed_test_df.copy()
anonymized_df['student_id'] = anonymized_df['student_id'].map(student_id_mapping)
anonymized_df['first_name'] = anonymized_df['first_name'].map(first_name_mapping)
anonymized_df['last_name'] = anonymized_df['last_name'].map(last_name_mapping)
anonymized_df['teacher'] = anonymized_df['teacher'].map(teacher_mapping)
anonymized_df['test_id'] = anonymized_df['test_id'].map(test_id_mapping)

# Display the result
display(anonymized_df)

	student_id	first_name	last_name	teacher	test_id	date
0	S-234271	Megan	Barber	Mrs. Jennifer Atkinson	Anatomy_Final_20231219	2023-12-04
1	S-234271	Megan	Barber	Mrs. Jennifer Atkinson	Biology_Midterm_20231023	2023-12-18
2	S-234271	Megan	Barber	Mrs. Jennifer Atkinson	Biology_Quiz_20240106	2023-10-30
3	S-234271	Megan	Barber	Mrs. Jennifer Atkinson	Chemistry_Quiz_20231113	2023-10-20
4	S-234271	Megan	Barber	Mrs. Jennifer Atkinson	Anatomy_Exam_20230817	2023-09-29
...	...	...	...	...	...	...
11175	S-981056	Kayla	Lawson	Juan Brooks	Biology_Exam_20230923	2016-10-04
11176	S-981056	Kayla	Lawson	Juan Brooks	Anatomy_Project_20231219	2016-10-18
11177	S-981056	Kayla	Lawson	Juan Brooks	Chemistry_Quiz_20231113	2016-09-29
11178	S-981056	Kayla	Lawson	Juan Brooks	Biology_Project_20240512	2016-11-14
11179	S-981056	Kayla	Lawson	Juan Brooks	Chemistry_Project_20231231	2016-10-27

11180 rows × 6 columns

Test Overview Interactive table#

Which tests did students miss?

from src.utils import display_interactive_student_table

# display_interactive_student_table(anonymized_df, width='80%', height= '400px')

Addirional test stats#

What tests did each student miss?

Overall Performance by Standard:#

	Question: How has each student performed per standard?

# Execute the pipeline
pipeline = m_db.comprehensive_test_analysis_pipeline(query)
results = list(students.aggregate(pipeline))

# Convert to DataFrame
df = pd.DataFrame(results)

Anonymizing Dataframe to protect sensitive data.#

import hashlib

def anonymize_df(df):
    # Create a copy of the dataframe
    anon_df = df.copy()
    
    # Create mappings for names and IDs
    student_id_map = {id: hashlib.md5(str(id).encode()).hexdigest()[:8] for id in df['student_id'].unique()}
    first_name_map = {name: fake.first_name() for name in df['first_name'].unique()}
    last_name_map = {name: fake.last_name() for name in df['last_name'].unique()}
    class_name_map = {class_name: f"Fake-class{fake.numerify('####')}" for class_name in df['class_name'].unique()}
    
    # Create a mapping for test_ids to ensure consistency
    test_id_map = {}
    for test_id in df['test_id'].unique():
        parts = test_id.split('_')
        if len(parts) >= 3:
            new_test_id = f"{class_name_map.get(parts[0], parts[0])}_{fake.word().capitalize()}_{fake.date_this_year().strftime('%Y%m%d')}"
        else:
            new_test_id = f"Test_{fake.word().capitalize()}_{fake.date_this_year().strftime('%Y%m%d')}"
        test_id_map[test_id] = new_test_id
    
    # Apply mappings
    anon_df['student_id'] = anon_df['student_id'].map(student_id_map)
    anon_df['first_name'] = anon_df['first_name'].map(first_name_map)
    anon_df['last_name'] = anon_df['last_name'].map(last_name_map)
    anon_df['class_name'] = anon_df['class_name'].map(class_name_map)
    anon_df['test_id'] = anon_df['test_id'].map(test_id_map)
    
    # Update test_name based on the new test_id
    anon_df['test_name'] = anon_df['test_id'].apply(lambda x: ' '.join(x.split('_')[1:-1]))
    
    
    return anon_df

# Apply the anonymization
anonymized_df2 = anonymize_df(df)

# Display the result
display(anonymized_df2.head(5))

	_id	student_id	first_name	last_name	test_id	test_name	class_name	question_id	question_type	student_response	correct_answer	standard	is_correct
0	6696bef5d6f16ed5d78693c9	e6a94901	Gwendolyn	Harvey	Fake-class4559_Gas_20240110	Gas	Fake-class4559	Q1	Multiple Choice Question	B	B	Bio.1.1.1	True
1	6696bef5d6f16ed5d78693c9	e6a94901	Gwendolyn	Harvey	Fake-class4559_Gas_20240110	Gas	Fake-class4559	Q10	Multiple Choice Question	A	A	Bio.3.2.1	True
2	6696bef5d6f16ed5d78693c9	e6a94901	Gwendolyn	Harvey	Fake-class4559_Gas_20240110	Gas	Fake-class4559	Q11	Multiple Choice Question	A	D	Bio.3.2.2	False
3	6696bef5d6f16ed5d78693c9	e6a94901	Gwendolyn	Harvey	Fake-class4559_Gas_20240110	Gas	Fake-class4559	Q12	Multiple Choice Question	C	B	Bio.3.3.2	False
4	6696bef5d6f16ed5d78693c9	e6a94901	Gwendolyn	Harvey	Fake-class4559_Gas_20240110	Gas	Fake-class4559	Q13	Multiple Choice Question	A	A	Bio.3.3.3	True

# General test analysis data (note: anonymized df is being used instead of df)
performance_by_standard = anonymized_df2.groupby(['student_id', 'first_name', 'last_name', 'standard', 'class_name']).agg({
    'is_correct': ['mean', 'count']
}).reset_index()
performance_by_standard.columns = ['student_id', 'first_name', 'last_name', 'standard', 'class_name', 'avg_performance', 'total_questions']
performance_by_standard['avg_performance'] *= 100  # Convert to percentage
performance_by_standard['avg_performance'] = performance_by_standard['avg_performance'].round(2) # Round off averages

display(performance_by_standard.head(5))

	student_id	first_name	last_name	standard	class_name	avg_performance	total_questions
0	00572bac	Laura	Spencer	Chm.1.1.1	Fake-class4054	100.0	12
1	00572bac	Laura	Spencer	Chm.1.1.3	Fake-class4054	0.0	2
2	00572bac	Laura	Spencer	Chm.1.1.4	Fake-class4054	100.0	2
3	00572bac	Laura	Spencer	Chm.1.2.1	Fake-class4054	0.0	2
4	00572bac	Laura	Spencer	Chm.1.2.4	Fake-class4054	0.0	4

Visualization of performance by standards for all students of all classes#

from src.utils import create_performance_dash_app
# create_performance_dash_app(performance_by_standard)

Overall Performance by Standard:#

	Question: Who are the top performing students for each standard?

# Using Fake data (anonymized_df2 vs df)
def get_top_n_students(anonymized_df2, n=5, upper=True):
    # Create a rank within each standard
    anonymized_df2['rank'] = anonymized_df2.groupby('standard')['avg_performance'].rank(method='first', ascending=not upper)
    
    # Filter to keep only the top/bottom n ranks
    result = anonymized_df2[anonymized_df2['rank'] <= n]
    
    # Sort the final result to maintain the desired order
    result = result.sort_values(['standard', 'avg_performance', 'total_questions'], 
                                ascending=[True, not upper, False])
    
    # Drop the temporary rank column
    result = result.drop('rank', axis=1)
    
    return result.reset_index(drop=True)

# Get top 5 students for each standard
top_5_students = get_top_n_students(performance_by_standard, n=5, upper=True)
print("Top 5 Performing Students for Each Standard (first four shown):")
display(top_5_students.head(20))

Top 5 Performing Students for Each Standard (first four shown):

	student_id	first_name	last_name	standard	class_name	avg_performance	total_questions
0	53beba4a	Michelle	White	Bio.1.1.1	Fake-class2111	100.0	39
1	94a5fc51	Alexa	Johnson	Bio.1.1.1	Fake-class2111	100.0	39
2	1bc64804	Brian	Carr	Bio.1.1.1	Fake-class3907	100.0	3
3	1f1632cf	Anne	Roberts	Bio.1.1.1	Fake-class3907	100.0	3
4	b36f948a	Gregory	Elliott	Bio.1.1.1	Fake-class3907	100.0	3
5	356940e2	Dawn	Patel	Bio.1.1.2	Fake-class2111	100.0	38
6	1f2dd468	Roberta	Smith	Bio.1.1.2	Fake-class8861	100.0	14
7	17c27f6b	Diane	Hall	Bio.1.1.2	Fake-class9903	100.0	6
8	40765158	Caroline	Jensen	Bio.1.1.2	Fake-class9903	100.0	6
9	17403e5a	Julie	Diaz	Bio.1.1.2	Fake-class9903	100.0	3
10	0b1640fc	Sarah	King	Bio.1.1.3	Fake-class1148	100.0	8
11	1848863e	Johnny	Walton	Bio.1.1.3	Fake-class7747	100.0	2
12	0b4a7ae3	Derek	Little	Bio.1.1.3	Fake-class5191	100.0	1
13	11c7c31e	Natalie	Morales	Bio.1.1.3	Fake-class5191	100.0	1
14	17403e5a	Julie	Diaz	Bio.1.1.3	Fake-class5191	100.0	1
15	17c27f6b	Diane	Hall	Bio.1.2.1	Fake-class9903	100.0	22
16	26bd8d3f	Catherine	Peterson	Bio.1.2.1	Fake-class9903	100.0	22
17	37054f91	Michelle	Thomas	Bio.1.2.1	Fake-class9903	100.0	22
18	3ae9501e	Charles	Park	Bio.1.2.1	Fake-class9903	100.0	22
19	1f2dd468	Roberta	Smith	Bio.1.2.1	Fake-class8861	100.0	13

Overall Performance by Standard:#

	Question: Who are the lowest performing students for each standard?

# Get lowest 5 students for each standard
bottom_5_students = get_top_n_students(performance_by_standard, n=5, upper=False)

print("Bottom 5 Performing Students for Each Standard (first four shown):")
display(bottom_5_students.head(20))

Bottom 5 Performing Students for Each Standard (first four shown):

	student_id	first_name	last_name	standard	class_name	total_questions
0	102354b4	Charles	Brown	Bio.1.1.1	Fake-class5191	30
1	11b2225e	Julia	Morris	Bio.1.1.1	Fake-class5191	30
2	1440a1b6	Jake	Sutton	Bio.1.1.1	Fake-class5191	30
3	1227c936	Christina	Galvan	Bio.1.1.1	Fake-class3549	20
4	01d86cf1	Michael	Perez	Bio.1.1.1	Fake-class3907	3
5	102354b4	Charles	Brown	Bio.1.1.2	Fake-class5191	25
6	0b4a7ae3	Derek	Little	Bio.1.1.2	Fake-class9903	3
7	0b9f19df	Cassandra	Brown	Bio.1.1.2	Fake-class9903	3
8	0d5c1de7	Michelle	Clark	Bio.1.1.2	Fake-class9903	3
9	102354b4	Charles	Brown	Bio.1.1.2	Fake-class9903	3
10	01d86cf1	Michael	Perez	Bio.1.1.3	Fake-class3907	4
11	030f3b3b	Kenneth	Rodriguez	Bio.1.1.3	Fake-class3907	4
12	013c5340	Garrett	Perez	Bio.1.1.3	Fake-class7747	2
13	0584887e	Kristina	Taylor	Bio.1.1.3	Fake-class5191	1
14	078a6cba	Danielle	Sherman	Bio.1.1.3	Fake-class5191	1
15	05650d90	Wayne	Perry	Bio.1.2.1	Fake-class2111	26
16	0b4a7ae3	Derek	Little	Bio.1.2.1	Fake-class9903	11
17	0b9f19df	Cassandra	Brown	Bio.1.2.1	Fake-class9903	11
18	102354b4	Charles	Brown	Bio.1.2.1	Fake-class9903	11
19	11b2225e	Julia	Morris	Bio.1.2.1	Fake-class9903	11

Overall Performance by Standard:#

	Question: Which standards are the most difficult as measured by average performance? 

Standards ranked by difficulty#

# Calculate average performance for each standard (using the anonymized df instead of df)
standard_difficulty = anonymized_df2.groupby('standard').agg({
    'is_correct': ['mean', 'count'],
    'student_id': 'nunique'
}).reset_index()

# Flatten column names
standard_difficulty.columns = ['standard', 'avg_performance', 'total_questions', 'unique_students']

# Convert average performance to percentage and round
standard_difficulty['avg_performance'] = (standard_difficulty['avg_performance'] * 100).round(2)

# Sort by average performance (ascending) to see most difficult standards first
standard_difficulty = standard_difficulty.sort_values('avg_performance', ascending=True)

# Add a difficulty rank
standard_difficulty['difficulty_rank'] = standard_difficulty['avg_performance'].rank(method='dense', ascending=True)

# Reset index again
standard_difficulty.reset_index(drop=True)

print("Standards Ranked by Difficulty (Most Difficult First):")
display(standard_difficulty)

Standards Ranked by Difficulty (Most Difficult First):

	standard	avg_performance	total_questions	unique_students	difficulty_rank
36	Chm.1.2.2	16.67	192	74	1.0
92	PSc.3.1.1	17.80	236	59	2.0
70	EEn.2.4.1	21.59	602	43	3.0
85	PSc.2.2.2	22.17	2436	59	4.0
55	Chm.3.1.3	22.40	384	74	5.0
...	...	...	...	...	...
66	EEn.2.1.1	67.26	2236	43	98.0
67	EEn.2.1.2	67.44	86	43	99.0
57	Chm.3.2.2	68.18	44	22	100.0
62	EEn.1.1.1	74.42	602	43	101.0
58	Chm.3.2.3	75.42	472	107	102.0

104 rows × 5 columns

Overall Performance by Standard:#

	Question: Which standards are students most proficient in?

#Highest scoring standards 
display(standard_difficulty.sort_values(by='difficulty_rank', ascending=False))

	standard	avg_performance	total_questions	unique_students	difficulty_rank
58	Chm.3.2.3	75.42	472	107	102.0
62	EEn.1.1.1	74.42	602	43	101.0
57	Chm.3.2.2	68.18	44	22	100.0
67	EEn.2.1.2	67.44	86	43	99.0
66	EEn.2.1.1	67.26	2236	43	98.0
...	...	...	...	...	...
55	Chm.3.1.3	22.40	384	74	5.0
85	PSc.2.2.2	22.17	2436	59	4.0
70	EEn.2.4.1	21.59	602	43	3.0
92	PSc.3.1.1	17.80	236	59	2.0
36	Chm.1.2.2	16.67	192	74	1.0

104 rows × 5 columns

Standard performance overview#

Question Type Difficulty#

        Question: What are the difficulty rankings for different question types?

What item types are in Mastery Connect?

# Calculate difficulty for each question type
question_type_difficulty = anonymized_df2.groupby('question_type').agg({
    'is_correct': ['mean', 'count'],
    'student_id': 'nunique'
}).reset_index()

# Flatten column names
question_type_difficulty.columns = ['question_type', 'avg_performance', 'total_questions', 'unique_students']

# Convert average performance to percentage and round
question_type_difficulty['avg_performance'] = (question_type_difficulty['avg_performance'] * 100).round(2)

# Sort by average performance (ascending) to see most difficult question types first
question_type_difficulty = question_type_difficulty.sort_values('avg_performance', ascending=True)

# Add a difficulty rank
question_type_difficulty['difficulty_rank'] = question_type_difficulty['avg_performance'].rank(method='dense', ascending=True)

print("Question Types Ranked by Difficulty (Most Difficult First):")
display(question_type_difficulty)

Question Types Ranked by Difficulty (Most Difficult First):

	question_type	avg_performance	total_questions	unique_students	difficulty_rank
1	Classification	0.00	32	16	1.0
0	Choice Matrix	8.05	447	130	2.0
2	Cloze Association	12.93	379	130	3.0
4	Highlight	13.01	292	130	4.0
3	Cloze Dropdown	14.71	136	81	5.0
6	Match List	14.82	2855	130	6.0
7	Multi Select	18.71	930	233	7.0
5	Label Image with Drag and Drop	19.62	2156	210	8.0
11	Order List	32.79	369	130	9.0
8	Multiple Choice	40.15	177247	709	10.0
12	Sort List	45.12	246	81	11.0
10	Multiple Choice Question	46.74	60075	442	12.0
9	Multiple Choice (Single Response)	50.96	13674	43	13.0

Question type difficulty visualization#

Student Progress Overtime:#

	Question: How has students progress changed over time

# Using anonymized df
from src.utils import display_performance_figure
# Check if we have a date column
if 'date' not in anonymized_df2.columns:
    # Extract date from test_id if necessary
    anonymized_df2['date'] = pd.to_datetime(anonymized_df2['test_id'].str.split('_').str[-1], format='%Y%m%d')

# Ensure date is in datetime format
anonymized_df2['date'] = pd.to_datetime(anonymized_df2['date'])

# display_performance_figure(anonymized_df2, performance_trend_df)

Exploratory Data Analyis of MasteryConnect Student Test Data

Contents

Exploratory Data Analyis of MasteryConnect Student Test Data#

Problem Statement#

Imports#

Initialize database#

Filter data for particular teacher(s) or time period#

Statistical analysis#

numeric data across all classes#

Frequency distribution across all classes#

Distribution Visualizations#

Performance over time for all students#

Vizualization for performance trend overtime#

Students who did not complete a test#

Anonymizing Data#

Test Overview Interactive table#

Addirional test stats#

Overall Performance by Standard:#

Anonymizing Dataframe to protect sensitive data.#

Visualization of performance by standards for all students of all classes#

Overall Performance by Standard:#

Overall Performance by Standard:#

Overall Performance by Standard:#

Standards ranked by difficulty#

Overall Performance by Standard:#

Standard performance overview#

Question Type Difficulty#

Question type difficulty visualization#

Student Progress Overtime:#