Every year, high school seniors rise to one of their largest life challenges: College Admission. Each country has its system for college admission shaping the career journey of its students. Most countries use some form of college admission test. Today, I will be exploring with you Egypt’s college admission test: Sanawya Amma (ثانوية عامة).

Taking the exam is a rite of passage for any Egyptian student. Students know that their whole future depends on the exam results. Hence, preparation for it starts months before the exams, and when the exams start the country seems to stop in unity, wishing success for these students.

It is no surprise that the students give the assessment this feeling of importance. Students enter the faculty and the university of their choice based on the exam’s result. After the exam, the ministry releases the minimum grade for each faculty. If a student scores below the grade assigned to his faculty of choice, he won’t be able to fulfill his dreams. Thus, the test is given this huge importance.

Given the vitality of this test, we are going to discover its results. We’re aiming to understand what factors into exam success, and also analyze the distribution of the marks. By the end, we will hopefully find useful insights for the students. So, let’s get started!

About the dataset

In this analysis, we will use the High School (ثانوية عامة) Public Results 2022 EG from Kaggle found here. This dataset contains the exam results for the year 2022. The data was collected from public websites that released the test results.

Let’s start by viewing some instances in the dataset and the avillable feature.

import pandas as pd

df = pd.read_csv('High_School_Public_Results_2022_EG_first_attempt.csv')
df.drop('desk_no', axis=1, inplace=True)
df.head()

	school_name	administration	city	branch	Percentage	status	arabic	first_foreign_lang	second_foreign_lang	pure_mathematics	...	chemistry	biology	geology	applied_math	physics	total	religion	national_education	economics_statistics	gender
0	الاورمان الرسمية لغات بنين	الدقى	الجيزة	أدبي	87.80	ناجح	61.0	27.0	34.0	NaN	...	NaN	NaN	NaN	NaN	NaN	360.0	17.0	25.0	37.0	M
1	جمال عبد الناصرالرسمية لغات بنات	الدقى	الجيزة	علمي علوم	57.32	ناجح	47.0	25.0	26.0	NaN	...	30.0	30.0	44.0	NaN	33.0	235.0	18.0	20.0	27.0	F
2	هضبة الاهرام ث التجريبية لغات بنين	الهرم	الجيزة	أدبي	83.41	ناجح	70.0	38.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	342.0	25.0	25.0	35.0	M
3	التحرير الرسمية لغات بنين	أكتوبر	الجيزة	أدبي	53.17	ناجح	57.0	27.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	218.0	20.0	17.5	27.0	M
4	التحرير الرسمية لغات بنين	أكتوبر	الجيزة	علمي رياضة	51.46	دور ثاني	56.0	25.0	6.0	30.0	...	30.0	NaN	NaN	33.0	31.0	211.0	18.0	20.0	31.0	M

5 rows × 24 columns

df.columns

Index(['school_name', 'administration', 'city', 'branch', 'Percentage',
       'status', 'arabic', 'first_foreign_lang', 'second_foreign_lang',
       'pure_mathematics', 'history', 'geography', 'philosophy', 'psychology',
       'chemistry', 'biology', 'geology', 'applied_math', 'physics', 'total',
       'religion', 'national_education', 'economics_statistics', 'gender'],
      dtype='object')

The data set contains the following columns: - school_name: The name of the student’s school. - administration: The administration of the student’s school. An administration is a group of schools that are managed by the same entity. They are divided based on the geographical location of the schools. - city: The city where the student is from. - branch: The branch of Sanawya Amma the student is in. There are three branches: Humanities (ادبي), Science (علمي علوم), and Mathematics (علمي رياضة). - Percentage: The student’s scaled grade in the exam out of 100. This is the most important feature as it determines the student’s future. - status: The student’s status in the exam. There are three statuses: Passed, Second Chance, and Failed. More on that in the next section. - [subject] columns: The student’s grade in the subject. We will see the subjects in the next section. - total: The student’s total grade in the exam. This is the sum of the student’s grades in all subjects. It is out of 410. The percentage is calculated by dividing the total by 410 and multiplying by 100. - gender: The student’s gender. Either M or F.

Sanawya Amma Structure

Branches

Sanawya Amma has three branches: Humanities (ادبي), Science (علمي علوم), and Mathematics (علمي رياضة). Each branch has its own set of subjects. The subjects are divided into two categories: Core and Elective. The core subjects are the same for all branches, while the elective subjects are different for each branch.

Let’s understand how the students are distributed among the branches.

Subjects

The core subjects are divided to two categories: those who affect the total grade and those who don’t. Students are only required to pass the subjects that affect the total grade. To ease the analysis, we will use core subjects to refer to the subjects that affect the total grade, elective subjects to electives that affect the total grade, and pass-fail subjects to the subjects that don’t affect the total grade. Note: that this is not the official terminology.

The core subjects are: - Arabic: The Arabic language scored out of 80. - First Foreign Language: The first foreign language, English, scored out of 50. - Second Foreign Language: The second foreign language, French or German, scored out of 40.

The elective subjects for each branch are: - Humanities: - History: Scored out of 60. - Geography: Scored out of 60. - Philosophy: Scored out of 60. - Psychology: Scored out of 60. - Science: - Biology: Scored out of 60. - Chemistry: Scored out of 60. - Physics: Scored out of 60. - Geology: Scored out of 60. - Mathematics: - Pure Mathematics: Scored out of 60. - Applied Mathematics: Scored out of 60. - Physics: Scored out of 60. - Chemistry: Scored out of 60.

The pass-fail subjects are: - Religion - National Education - Economics and Statistics

Students usually don’t prepare as well for the pass-fail subjects as they do for the core and elective subjects. Therefore, we will focus on the core and elective subjects in our analysis.

Note: This represents the 2022 Sanawya Amma exam and isn’t necessarily the same for other years.

Since the subjects are scored differently, let’s visualize how each subject affects the total grade.

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

subjects = ['Arabic', 'First Foreign Language',
'Second Foreign Language', 'Pure Mathematics', 'History', 'Geography',
'Philosophy', 'Psychology', 'Chemistry', 'Biology', 'Geology',
'Applied Math', 'Physics']
max_grades = [80, 50, 40, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60]

branch_subjects_map = {
    'Humanities': ['Arabic', 'First Foreign Language', 'Second Foreign Language',
                   'History', 'Geography', 'Philosophy', 'Psychology'],
    
    'Science': ['Arabic', 'First Foreign Language', 'Second Foreign Language', 
                'Chemistry', 'Biology', 'Geology', 'Physics'],
    
    'Mathematics': ['Arabic', 'First Foreign Language', 'Second Foreign Language', 
                    'Pure Mathematics', 'Applied Math', 'Physics', 'Chemistry']
}

subjects_df = pd.DataFrame({'subject': subjects, 'max_grade': max_grades})


def plot_subjects_grade_contribution_for_branch(branch):
    branch_subjects = branch_subjects_map[branch]
    branch_subjects_df = subjects_df[subjects_df['subject'].isin(branch_subjects)].sort_values('max_grade', ascending=False)
    
    sorted_branch_subjects_df = branch_subjects_df.sort_values(by='max_grade', ascending=False)
    
    plt.figure(figsize=(8, 8))
    plt.pie(
        sorted_branch_subjects_df['max_grade'],
        labels=sorted_branch_subjects_df['subject'],
        autopct='%1.1f%%',
        startangle=140,
        colors=sns.color_palette('Blues', len(sorted_branch_subjects_df))
    )
    plt.title(f'{branch} Branch Grades Distribution')
    plt.show()

  
plot_subjects_grade_contribution_for_branch('Humanities')

plot_subjects_grade_contribution_for_branch('Science')

plot_subjects_grade_contribution_for_branch('Mathematics')

Status

The students can have one of three statuses: - Passed: The student passed the exam, scored more than 50% every subject. - Second Chance: The student failed at most three subjects and is eligible for a second chance to pass them. - Failed: The student failed more than three subjects and has to retake all the subjects next year.

During the analyisis we won’t consider the retake for Second Chance students. We will only consider the first attempt; thus, we will consider the Second Chance students as Failed students.

Cleaning the dataset

Let’s start by renaming the columns to make them more readable.

df.columns

Index(['school_name', 'administration', 'city', 'branch', 'Percentage',
       'status', 'arabic', 'first_foreign_lang', 'second_foreign_lang',
       'pure_mathematics', 'history', 'geography', 'philosophy', 'psychology',
       'chemistry', 'biology', 'geology', 'applied_math', 'physics', 'total',
       'religion', 'national_education', 'economics_statistics', 'gender'],
      dtype='object')

columns_rename_map = {
  'school_name': 'School Name',
  'administration': 'Administration',
  'city': 'City',
  'branch': 'Branch',
  'status': 'Status',
  'arabic': 'Arabic',
  'first_foreign_lang': 'First Foreign Language',
  'second_foreign_lang': 'Second Foreign Language',
  'pure_mathematics': 'Pure Mathematics',
  'history': 'History',
  'geography': 'Geography',
  'philosophy': 'Philosophy',
  'psychology': 'Psychology',
  'chemistry': 'Chemistry',
  'biology': 'Biology',
  'geology': 'Geology',
  'applied_math': 'Applied Math',
  'physics': 'Physics',
  'total': 'Total',
  'religion': 'Religion',
  'national_education': 'National Education',
  'economics_statistics': 'Economics Statistics',
  'gender': 'Gender',
}

df = df.rename(columns=columns_rename_map)
df.head()

	School Name	Administration	City	Branch	Percentage	Status	Arabic	First Foreign Language	Second Foreign Language	Pure Mathematics	...	Chemistry	Biology	Geology	Applied Math	Physics	Total	Religion	National Education	Economics Statistics	Gender
0	الاورمان الرسمية لغات بنين	الدقى	الجيزة	أدبي	87.80	ناجح	61.0	27.0	34.0	NaN	...	NaN	NaN	NaN	NaN	NaN	360.0	17.0	25.0	37.0	M
1	جمال عبد الناصرالرسمية لغات بنات	الدقى	الجيزة	علمي علوم	57.32	ناجح	47.0	25.0	26.0	NaN	...	30.0	30.0	44.0	NaN	33.0	235.0	18.0	20.0	27.0	F
2	هضبة الاهرام ث التجريبية لغات بنين	الهرم	الجيزة	أدبي	83.41	ناجح	70.0	38.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	342.0	25.0	25.0	35.0	M
3	التحرير الرسمية لغات بنين	أكتوبر	الجيزة	أدبي	53.17	ناجح	57.0	27.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	218.0	20.0	17.5	27.0	M
4	التحرير الرسمية لغات بنين	أكتوبر	الجيزة	علمي رياضة	51.46	دور ثاني	56.0	25.0	6.0	30.0	...	30.0	NaN	NaN	33.0	31.0	211.0	18.0	20.0	31.0	M

5 rows × 24 columns

Let’s also rename the branches and the status to English to allow more people to understand the analysis.

branches_map = {
  'أدبي': 'Humanities',
  'علمي علوم': 'Science',
  'علمي رياضة': 'Mathematics'
}

status_map = {
  'ناجح': 'Pass',
  'دور ثاني': 'Fail',
  'راسب': 'Fail'
}

df['Branch'] = df['Branch'].map(branches_map)
df['Status'] = df['Status'].map(status_map)

df.head()

	School Name	Administration	City	Branch	Percentage	Status	Arabic	First Foreign Language	Second Foreign Language	Pure Mathematics	...	Chemistry	Biology	Geology	Applied Math	Physics	Total	Religion	National Education	Economics Statistics	Gender
0	الاورمان الرسمية لغات بنين	الدقى	الجيزة	Humanities	87.80	Pass	61.0	27.0	34.0	NaN	...	NaN	NaN	NaN	NaN	NaN	360.0	17.0	25.0	37.0	M
1	جمال عبد الناصرالرسمية لغات بنات	الدقى	الجيزة	Science	57.32	Pass	47.0	25.0	26.0	NaN	...	30.0	30.0	44.0	NaN	33.0	235.0	18.0	20.0	27.0	F
2	هضبة الاهرام ث التجريبية لغات بنين	الهرم	الجيزة	Humanities	83.41	Pass	70.0	38.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	342.0	25.0	25.0	35.0	M
3	التحرير الرسمية لغات بنين	أكتوبر	الجيزة	Humanities	53.17	Pass	57.0	27.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	218.0	20.0	17.5	27.0	M
4	التحرير الرسمية لغات بنين	أكتوبر	الجيزة	Mathematics	51.46	Fail	56.0	25.0	6.0	30.0	...	30.0	NaN	NaN	33.0	31.0	211.0	18.0	20.0	31.0	M

5 rows × 24 columns

Now, let’s check for missing values.

df.isnull().sum().sort_values(ascending=False)

Pure Mathematics           585619
Applied Math               585543
History                    424193
Psychology                 424051
Philosophy                 424014
Geography                  423921
Biology                    359913
Geology                    359572
Physics                    264460
Chemistry                  263995
Economics Statistics        72219
Religion                    71800
National Education          71608
Second Foreign Language      5690
First Foreign Language       5109
Arabic                       3792
Gender                          3
Administration                  0
Status                          0
Total                           0
Percentage                      0
Branch                          0
City                            0
School Name                     0
dtype: int64

We find that there are missing values in the Gender column. Since we won’t be using this column in the analysis, we can ignore the missing values.

It would be ok if a student has missing values in the subjects columns given that the student is not taking the subject or he failed. So, let’s search for students contradicting this rule.

def has_missing_subject(row):
  student_branch_subjects = branch_subjects_map[row['Branch']]
  for subject in student_branch_subjects:
    if pd.isnull(row[subject]):
      return True
  return False

df['Has Missing Subject'] = df.apply(has_missing_subject, axis=1)
len(df[df['Has Missing Subject'] & (df['Status'] == 'Pass')]) / len(df) * 100

0.19945834090522724

We find that such students exist; however, they are very few, only 0.2% of the students. This is probably due to a mistake in the data entry. We can safely remove these students.

To ease the analysis, let’s scale the grades of the subjects to be out of 100. This will allow us to compare the grades of the subjects.

def scale_subjects(df):
  for subject in subjects:
    df[subject] = df[subject] / subjects_df[subjects_df['subject'] == subject]['max_grade'].values[0] * 100
  return df

df = scale_subjects(df)
df.describe()

	Percentage	Arabic	First Foreign Language	Second Foreign Language	Pure Mathematics	History	Geography	Philosophy	Psychology	Chemistry	Biology	Geology	Applied Math	Physics	Total	Religion	National Education	Economics Statistics
count	682348.000000	678556.000000	677239.000000	676658.000000	96729.000000	258155.000000	258427.000000	258334.000000	258297.000000	418353.000000	322435.000000	322776.000000	96805.000000	417888.000000	682348.000000	610548.000000	610740.000000	610129.000000
mean	63.135943	62.578077	65.820089	72.626713	62.018112	54.623944	55.218434	63.507159	62.866113	64.482905	59.958705	72.446738	65.752131	59.311801	258.857377	19.023293	18.301424	29.652762
std	14.879646	14.414757	20.419700	20.723767	19.280541	16.187493	13.967272	13.460551	11.941338	19.698213	15.429559	17.856085	18.961954	20.245909	61.006543	2.863957	3.237647	5.241791
min	0.000000	1.250000	2.000000	2.500000	1.666667	1.666667	1.666667	1.666667	1.666667	3.333333	1.666667	1.666667	1.666667	1.666667	0.000000	1.000000	1.000000	2.000000
25%	52.930000	50.000000	50.000000	55.000000	50.000000	50.000000	50.000000	53.333333	55.000000	50.000000	50.000000	60.000000	50.000000	50.000000	217.000000	17.000000	16.000000	25.000000
50%	62.680000	61.250000	66.000000	75.000000	61.666667	50.000000	51.666667	65.000000	63.333333	63.333333	58.333333	76.666667	66.666667	55.000000	257.000000	19.000000	19.000000	29.000000
75%	73.660000	73.750000	82.000000	90.000000	76.666667	65.000000	63.333333	73.333333	71.666667	81.666667	71.666667	86.666667	81.666667	75.000000	302.000000	21.000000	21.000000	33.000000
max	99.270000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	407.000000	25.000000	25.000000	50.000000

Let’s also add a Min Grade, Max Grade, and Mean Grade columns for each student. This will allow us to ensure Status is correctly assigned.

def get_min_grade(row):
  student_branch_subjects = branch_subjects_map[row['Branch']]
  return row[student_branch_subjects].min()

def get_max_grade(row):
  student_branch_subjects = branch_subjects_map[row['Branch']]
  return row[student_branch_subjects].max()

def get_mean_grade(row):
  student_branch_subjects = branch_subjects_map[row['Branch']]
  return row[student_branch_subjects].mean()

df['Min Grade'] = df.apply(get_min_grade, axis=1)
df['Max Grade'] = df.apply(get_max_grade, axis=1)
df['Mean Grade'] = df.apply(get_mean_grade, axis=1)
df.describe()

	Percentage	Arabic	First Foreign Language	Second Foreign Language	Pure Mathematics	History	Geography	Philosophy	Psychology	Chemistry	...	Geology	Applied Math	Physics	Total	Religion	National Education	Economics Statistics	Min Grade	Max Grade	Mean Grade
count	682348.000000	678556.000000	677239.000000	676658.000000	96729.000000	258155.000000	258427.000000	258334.000000	258297.000000	418353.000000	...	322776.000000	96805.000000	417888.000000	682348.000000	610548.000000	610740.000000	610129.000000	681293.000000	681293.000000	681293.000000
mean	63.135943	62.578077	65.820089	72.626713	62.018112	54.623944	55.218434	63.507159	62.866113	64.482905	...	72.446738	65.752131	59.311801	258.857377	19.023293	18.301424	29.652762	47.976382	80.370899	64.126046
std	14.879646	14.414757	20.419700	20.723767	19.280541	16.187493	13.967272	13.460551	11.941338	19.698213	...	17.856085	18.961954	20.245909	61.006543	2.863957	3.237647	5.241791	17.380803	14.787331	14.623857
min	0.000000	1.250000	2.000000	2.500000	1.666667	1.666667	1.666667	1.666667	1.666667	3.333333	...	1.666667	1.666667	1.666667	0.000000	1.000000	1.000000	2.000000	1.250000	1.250000	1.250000
25%	52.930000	50.000000	50.000000	55.000000	50.000000	50.000000	50.000000	53.333333	55.000000	50.000000	...	60.000000	50.000000	50.000000	217.000000	17.000000	16.000000	25.000000	32.000000	70.000000	53.464286
50%	62.680000	61.250000	66.000000	75.000000	61.666667	50.000000	51.666667	65.000000	63.333333	63.333333	...	76.666667	66.666667	55.000000	257.000000	19.000000	19.000000	29.000000	50.000000	82.500000	63.666667
75%	73.660000	73.750000	82.000000	90.000000	76.666667	65.000000	63.333333	73.333333	71.666667	81.666667	...	86.666667	81.666667	75.000000	302.000000	21.000000	21.000000	33.000000	58.333333	92.500000	74.821429
max	99.270000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	...	100.000000	100.000000	100.000000	407.000000	25.000000	25.000000	50.000000	100.000000	100.000000	100.000000

8 rows × 21 columns

df[(df['Min Grade'] < 50) & (df['Status'] == 'Pass')]

	School Name	Administration	City	Branch	Percentage	Status	Arabic	First Foreign Language	Second Foreign Language	Pure Mathematics	...	Physics	Total	Religion	National Education	Economics Statistics	Gender	Has Missing Subject	Min Grade	Max Grade	Mean Grade
158069	ليسية الهرم الخاصة لغات	العمرانية	الجيزة	Mathematics	40.98	Pass	52.50	52.0	50.0	31.666667	...	23.333333	168.0	16.0	14.0	33.0	M	False	23.333333	55.0	41.119048
158074	ليسية الهرم الخاصة لغات	العمرانية	الجيزة	Mathematics	42.20	Pass	50.00	64.0	50.0	21.666667	...	26.666667	173.0	20.0	14.0	25.0	M	False	21.666667	64.0	42.714286
158205	محرم الاسلامية الخاصة لغات	العمرانية	الجيزة	Science	38.54	Pass	30.00	50.0	57.5	NaN	...	50.000000	158.0	18.0	16.0	25.0	M	False	26.666667	57.5	40.119048
158224	محرم الاسلامية الخاصة لغات	العمرانية	الجيزة	Mathematics	39.76	Pass	20.00	50.0	50.0	50.000000	...	50.000000	163.0	16.0	14.0	25.0	M	False	20.000000	50.0	41.428571
158225	محرم الاسلامية الخاصة لغات	العمرانية	الجيزة	Mathematics	47.32	Pass	50.00	68.0	85.0	50.000000	...	30.000000	194.0	19.0	13.0	25.0	M	False	30.000000	85.0	49.476190
158253	الواحة الخاصة لغات	العمرانية	الجيزة	Science	47.32	Pass	51.25	60.0	77.5	NaN	...	26.666667	194.0	20.0	13.0	26.0	F	False	26.666667	77.5	48.869048
158262	ابن عطاء الله الخاصة لغات	العمرانية	الجيزة	Humanities	45.37	Pass	56.25	54.0	70.0	NaN	...	NaN	186.0	15.0	17.0	25.0	M	False	23.333333	70.0	46.226190
158271	ابن عطاء الله الخاصة لغات	العمرانية	الجيزة	Science	49.51	Pass	56.25	68.0	50.0	NaN	...	50.000000	203.0	21.0	19.0	29.0	F	False	18.333333	68.0	49.654762

8 rows × 28 columns

We find this unexpected behavior in the data. There are 8 students who have a total grade less than 50 but are marked as Passed. This is probably due to a mistake in the data entry. We will remove these students.

df = df[~((df['Min Grade'] < 50) & (df['Status'] == 'Pass'))]
df[(df['Min Grade'] < 50) & (df['Status'] == 'Pass')]

	School Name	Administration	City	Branch	Percentage	Status	Arabic	First Foreign Language	Second Foreign Language	Pure Mathematics	...	Physics	Total	Religion	National Education	Economics Statistics	Gender	Has Missing Subject	Min Grade	Max Grade	Mean Grade

0 rows × 28 columns

How are the grades distributed?

Most well-known standardized tests have a normal distribution. This means that most students score around the average, and the further you go from the average, the fewer students you find. Having a normal distribution is a good thing because it means that the test is fair. Some inistitutions might have a policy to adjust the grades to make them normally distributed. This is usually done by curving the grades.

One example of such test is the SAT. The SAT is a standardized test used for college admission in the United States and American Diploma students in other countries. The SAT is designed to have a normal distribution. The College Board, the organization that administers the SAT, doesn’t score the grades on a curve. This means your grade isn’t affected by how other students perform. Yet, the SAT remains standerdized and grades in one year are comparable to grades from another year. College Board succeded in making the score distribution normal by designing the test to be fair and accurate.

Let’s see if the Sanawya Amma grades are normally distributed.

plt.figure(figsize=(10, 6))
sns.histplot(df['Percentage'], bins=30, kde=True, color='blue')

# Add vertical lines for passing grade and median grade
plt.axvline(x=50, color='red', linestyle='--', label='Passing Grade')
plt.axvline(x=df['Percentage'].median(), color='green', linestyle='--', label=f'Median Grade: {df["Percentage"].median():.2f}')

# Add title and labels
plt.title('Students Grades Distribution')
plt.xlabel('Grade out of 100')
plt.ylabel('Number of Students')
plt.legend()
plt.tight_layout()
plt.show()

We see in the histogram that the grades are approximately normally distributed. This is a good sign that the test is fair. The grades however are not centered around 50% as we would expect. The median grade is ariund 63%. This means there are more students scoring above the average than passing grade rather than below it. This indicates that there are much more students passing the exam than failing it. We can see this in the pie chart below.

status_counts = df['Status'].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(
    status_counts,
    labels=status_counts.index,
    autopct='%1.1f%%',
    startangle=140,
    colors=plt.cm.Paired.colors
)
plt.title('Students Status Distribution')
plt.show()

The pie char shows that the number of passing students is double the number of failing students.

Are all branches equal?

Let’s see how the students are distributed among the branches.

Note this graph is interactive. You can hover over the bars to see the exact number of students, and you can click on the legend to hide/show a branch.

colors = {'Humanities': '#3b82f6', 'Science': '#22c55e', 'Mathematics': '#ef4444'}

plt.figure(figsize=(12, 8))
sns.histplot(
    data=df,
    x='Percentage',
    hue='Branch',
    multiple='layer',
    stat='percent',
    palette=colors,
    bins=30
)

for branch, color in colors.items():
    branch_students = df[df['Branch'] == branch]
    median = branch_students['Percentage'].median()
    plt.axvline(x=median, color=color, linestyle='--', label=f'{branch} Median: {median:.2f}')

plt.title('Students Grade Distribution by Branch')
plt.xlabel('Grade out of 100')
plt.ylabel('Percentage of Students (%)')
plt.legend(title='Branch')
plt.tight_layout()
plt.show()

As we can see from the histogram, the grades for the Science and Mathematics branches are very simmilar. However, the Humanities branch has a lower much lower average grade. We can see in the histogram that the median grade for the Humanities branch is around 60%, while for the other branches it is around 65%. Moreover, the number of students scoring above 90% is much lower in the Humanities branch than in the other branches. This indicates one of two things: 1. The Humanities branch is harder than the other branches. 2. The students in the Humanities branch are less prepared than the other branches.

With the data we have, we can’t determine which of the two is true. However, we can see that the Humanities branch has a lower average grade.

How common is it to succeed in total yet fail in a subject?

Why do failing students fail?

A student is considered to have passed the exam if he scored more than 50% in every subject. Let’s see how common it is for a student to have a total grade above 50% yet fail in a subject.

df['Status'] = df['Status'].astype('category')

plt.figure(figsize=(12, 8))
sns.histplot(
    data=df,
    x='Percentage',
    hue='Status',
    multiple='layer',
    stat='count',
    palette='Set2',
    bins=30
)

plt.title('Students Grade Distribution by Status')
plt.xlabel('Grade out of 100')
plt.ylabel('Number of Students')

handles, labels = plt.gca().get_legend_handles_labels()
if not handles:
    plt.legend(title='Status', labels=df['Status'].cat.categories)
else:
    plt.legend(title='Status')

plt.tight_layout()
plt.show()

We make a very interesting observation. The grades distribution for failing student is normal and centered around 50%.

There is a common misconception that failing students are those who score very low in all subjects. However, this is not true. The grades of failing students are normally distributed around 50%. This means that failing students are those who score around the average in all subjects. This is a very interesting observation.

We can explore this further by seeing how common it is for a student to have a total grade above 50% yet fail in a subject.

failings = df[df['Status'] == 'Fail']
failings['Is Above 50'] = failings['Percentage'] >= 50

failings_counts = failings['Is Above 50'].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(
    failings_counts,
    labels=['Above 50%', 'Below 50%'],
    autopct='%1.1f%%',
    startangle=140,
    colors=sns.color_palette('Set2', 2)
)

plt.title('Failings Distribution Above and Below 50%')
plt.show()

/var/folders/vp/fng1b7k14nq4khc8kp2lp6wh0000gn/T/ipykernel_49788/2852520307.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  failings['Is Above 50'] = failings['Percentage'] >= 50

That’s astonishing to see. Around 44% of the students who failed the exam had a total grade above 50%. This means that almost half of the students who failed the exam scored above the passing grade in every subject.

This is a very interesting observation. It means that failing students are not those who score very low in all subjects. Instead, they are those who score around the average in all subjects. This is a very important insight for students preparing for the exam. It means that they should focus on all subjects equally. They shouldn’t ignore a subject just because they don’t like it or because they think they are bad at it. They should focus on all subjects equally to pass the exam. If you are sooo good in one subject yet horrible in another, you are likely to fail.

failings['Min Mean Difference'] = failings['Mean Grade'] - failings['Min Grade']

plt.figure(figsize=(10, 6))
sns.histplot(
    data=failings,
    x='Min Mean Difference',
    bins=30,
    kde=False,
    color='blue'
)

mean_value = failings['Min Mean Difference'].mean()
plt.axvline(x=mean_value, color='red', linestyle='--', label=f'Mean: {mean_value:.2f}')

plt.title('Failings Students Average Grade - Min Grade Distribution')
plt.xlabel('Average Grade - Min Grade for failing students')
plt.ylabel('Number of Students')
plt.legend()

plt.tight_layout()
plt.show()

/var/folders/vp/fng1b7k14nq4khc8kp2lp6wh0000gn/T/ipykernel_49788/3472875337.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  failings['Min Mean Difference'] = failings['Mean Grade'] - failings['Min Grade']

This graph again proves the same point. The average difference between a failing students minimum grade and his average grade is 21.48%. This is a very large difference. Once again this proves that failing students are those who ignore a subject and focus on others. This is a very important advice for students preparing for the exam: Don’t ignore any subject.

Are the Compassion Grades a myth?

There is a common belief among students that the ministry gives compassion grades to students who are close to passing. This means that if a student is very close to passing, the ministry will give him a few extra points to pass.

When we ploted the grades distribution, we saw that the grades are normally distributed. This means that the grades are fair and accurate. If the ministry was giving compassion grades, the grades wouldn’t be normally distributed. Instead, we would see a spike in the grades just above the passing grade. This is not the case. The total grades are normally distributed.

However, the marks are added on the subjects level. Hence, the myth might be true although the total grades are normally distributed. Let’s start exploring this by checking the distribution of the for the Arabic subject.

plt.figure(figsize=(10, 6))
sns.histplot(
    data=df,
    x='Arabic',
    bins=30,
    kde=False,
    color='blue'
)

plt.title('Arabic Grades Distribution')
plt.xlabel('Arabic Grades')
plt.ylabel('Number of Students')

plt.tight_layout()
plt.show()

WOW!

The myth seems true. The grades for the Arabic subject are not normally distributed. Instead, there is a HUGE spike in the grades just above the passing grade, and then the grades drop to 0. This is a clear indication that the ministry is giving compassion grades, but before making a conclusion, let’s quantify this phenomenon and check other subjects.

For each subject, we will calculate the percentage of students who failed within 2.5% of the passing grade and the percentage of students who passed within 2.5% of the passing grade. We will then compare these percentages to see if the ministry is giving compassion grades.

If the ministry is giving compassion grades, we would expect to see a much higher percentage of students passing within 2.5% of the passing grade than the percentage of students failing within 2.5% of the passing grade.

def get_barely_passing_students_percentage(df, subject, margin=2.5, threshold=50):
    return len(df[(df[subject] >= threshold) & (df[subject] <= threshold + margin)]) / len(df[df[subject] >= threshold]) * 100

def get_barely_failing_students_percentage(df, subject, margin=2.5, threshold=50):
    return len(df[(df[subject] < threshold) & (df[subject] >= threshold - margin)]) / len(df[df[subject] < threshold]) * 100

barely_passed_perc = [get_barely_passing_students_percentage(df, subject) for subject in subjects]
barely_failed_perc = [get_barely_failing_students_percentage(df, subject) for subject in subjects]

barely_passed_df = pd.DataFrame({
    'Subject': subjects,
    'Barely Passing Percentage': barely_passed_perc,
    'Barely Failing Percentage': barely_failed_perc
})

barely_passed_df_melted = barely_passed_df.melt(id_vars='Subject', 
                                                var_name='Status', 
                                                value_name='Percentage')

plt.figure(figsize=(12, 6))
sns.barplot(
    data=barely_passed_df_melted,
    x='Subject',
    y='Percentage',
    hue='Status',
    palette='Set2'
)

plt.title('Barely Passing and Failing Students Percentage by Subject')
plt.xlabel('Subject')
plt.ylabel('Percentage of Students (%)')
plt.xticks(rotation=45)
plt.legend(title='Status')

plt.tight_layout()
plt.show()

The difference is so big that in the above bar chart we can’t even see the percentage of students who failed within 2.5% of the passing grade. So, let’s print the exact percentages to see the difference.

barely_passed_df

	Subject	Barely Passing Percentage	Barely Failing Percentage
0	Arabic	24.907257	0.000000
1	First Foreign Language	20.700964	0.004478
2	Second Foreign Language	18.067144	0.004093
3	Pure Mathematics	28.560174	0.008562
4	History	46.203238	0.000000
5	Geography	43.848881	0.000000
6	Philosophy	15.973283	0.000000
7	Psychology	16.007772	0.000000
8	Chemistry	24.637011	0.000000
9	Biology	34.855668	0.000000
10	Geology	9.788448	0.000000
11	Applied Math	24.087328	0.000000
12	Physics	38.149330	0.000000

The difference is incredible. For 9 of the subjects, no student failed within 2.5% of the passing grade. This is a clear indication that the ministry is giving compassion grades. The myth is true.

Does city affect the grades?

An interesting question to ask is whether the city affects the grades. Are students from some cities more likely to score higher than students from other cities?

To make this analysis readable to a wider audience, I will begin by translating the city names to English.

city_english_names_map = {
  'القاهرة': 'Cairo',
  'الإسكندرية': 'Alexandria',
  'الجيزة': 'Giza',
  'الدقهلية': 'Dakahlia',
  'البحيرة': 'Beheira',
  'المنوفية': 'Monufia',
  'الشرقية': 'Sharqia',
  'الغربية': 'Gharbia',
  'الفيوم': 'Fayoum',
  'القليوبية': 'Qalyubia',
  'المنيا': 'Minya',
  'الاقصر': 'Luxor',
  'البحر الأحمر': 'Red Sea',
  'الوادي الجديد': 'New Valley',
  'السويس': 'Suez',
  'الاسكندرية': 'Alexandria',
  'الإسماعيلية': 'Ismailia',
  'اسوان': 'Aswan',
  'اسيوط': 'Assiut',
  'بني سويف': 'Beni Suef',
  'بورسعيد': 'Port Said',
  'جنوب سيناء': 'South Sinai',
  'دمياط': 'Damietta',
  'سوهاج': 'Sohag',
  'قنا': 'Qena',
  'كفر الشيخ': 'Kafr El Sheikh',
  'مطروح': 'Matrouh',
  'شمال سيناء': 'North Sinai'
}

df['City'] = df['City'].map(city_english_names_map)
df['City'].unique()

array(['Giza', 'Beni Suef', 'Fayoum', 'Cairo', 'Ismailia', 'Port Said',
       'Monufia', 'Gharbia', 'Alexandria', 'Sharqia', 'Assiut',
       'Damietta', 'Sohag', 'Dakahlia', 'Suez', 'Qalyubia', 'Beheira',
       'Kafr El Sheikh', 'Qena', 'Aswan', 'Matrouh', 'Minya',
       'North Sinai', 'Red Sea', 'South Sinai', 'Luxor', 'New Valley'],
      dtype=object)

Now let’s see how the grades are distributed among the cities.

cities = df['City'].unique()
student_count = [len(df[df['City'] == city]) for city in cities]
average_grade = [df[df['City'] == city]['Percentage'].mean() for city in cities]

cities_df = pd.DataFrame({'City': cities, 'Student Count': student_count, 'Average Grade': average_grade})
cities_df['Average Grade'] = cities_df['Average Grade'].round(1)

fig = px.treemap(
  cities_df,
  path=['City'],
  values='Average Grade',
  color='Average Grade',
  color_continuous_scale='Blues',
  title='Grades Distribution over Cities'
)

fig.show()

Unable to display output for mime type(s): application/vnd.plotly.v1+json

From the above tree map, we see that some cities have a much higher average grade than others. For example, students from North Sinai have average grades 74.6% while students from Minya have average grades 55.5%. This is a very large difference.

Now, I would like to know if the city’s average grade is related to the number of students from the city and the city’s geographical location.

fig = px.treemap(
  cities_df,
  path=['City'],
  values='Student Count',
  color='Average Grade',
  color_continuous_scale='Blues',
  title='Grades Distribution over Cities by Student Count'
)

fig.show()

Unable to display output for mime type(s): application/vnd.plotly.v1+json

We notice a weak negative correlation between the city’s average grade and the number of students from the city. This means that cities with fewer students have higher average grades; however, the correlation isn’t very strong.

city_coordinates_map = {
  'Cairo': (30.033333, 31.233334),
  'Alexandria': (31.200092, 29.918739),
  'Giza': (30.013056, 31.208853),
  'Dakahlia': (31.034431, 31.380691),
  'Beheira': (30.469561, 30.931739),
  'Monufia': (30.464713, 31.18422),
  'Sharqia': (30.587676, 31.501218),
  'Gharbia': (30.793408, 31.012645),
  'Fayoum': (29.3084, 30.8416),
  'Qalyubia': (30.328454, 31.243225),
  'Minya': (28.1187, 30.7416),
  'Luxor': (25.6872, 32.6396),
  'Red Sea': (26.7153, 33.9368),
  'New Valley': (25.6773, 28.8955),
  'Suez': (30.0075, 32.5498),
  'Ismailia': (30.5903, 32.2653),
  'Aswan': (24.088938, 32.899829),
  'Assiut': (27.1828, 31.0014),
  'Beni Suef': (29.0734, 31.0994),
  'Port Said': (31.2565, 32.2841),
  'South Sinai': (28.4177, 33.0451),
  'Damietta': (31.4165, 31.8133),
  'Sohag': (26.5569, 31.6948),
  'Qena': (26.1553, 32.7163),
  'Kafr El Sheikh': (31.1093, 30.9367),
  'Matrouh': (31.3549, 27.2373),
  'North Sinai': (31.2156, 33.3581)
}

cities_df['Latitude'] = cities_df['City'].map(lambda city: city_coordinates_map[city][0])
cities_df['Longitude'] = cities_df['City'].map(lambda city: city_coordinates_map[city][1])

fig = px.scatter_mapbox(
  cities_df,
  lat='Latitude',
  lon='Longitude',
  color='Average Grade',
  size='Student Count',
  hover_name='City',
  zoom=4,
  title='Grades Distribution over cities on Egypt Map',
  labels={'Average Percentage': 'Average Percentage'},
  color_continuous_scale='tealrose'
)

fig.update_layout(mapbox_style='open-street-map')

fig.show()

/var/folders/vp/fng1b7k14nq4khc8kp2lp6wh0000gn/T/ipykernel_49788/931561192.py:34: DeprecationWarning:

*scatter_mapbox* is deprecated! Use *scatter_map* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/

Unable to display output for mime type(s): application/vnd.plotly.v1+json

From the map, it seems that northern cities tend to have higher average grades than southern cities.

Do all subjects have the same difficulty?

It is no secret that a student tend to find some subjects easier than others, but what’s important are subjects that are precieved as difficult by most students. To figure this out we will plot a bbar chart for the mean grade of each subject.

subject_avg_grades = [df[subject].mean() for subject in subjects]
subject_avg_grades_df = pd.DataFrame({'Subject': subjects, 'Average Grade': subject_avg_grades})
subject_avg_grades_df = subject_avg_grades_df.sort_values('Average Grade', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(
    data=subject_avg_grades_df,
    x='Subject',
    y='Average Grade',
    palette='Blues_d'
)

plt.title('Average Grades by Subject')
plt.xlabel('Subject')
plt.ylabel('Average Grade')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

/var/folders/vp/fng1b7k14nq4khc8kp2lp6wh0000gn/T/ipykernel_49788/2442988210.py:6: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

From this graph, we can see that the subjects are not equally difficult. The subjects that are considered the most difficult are: - History - Geography

The subjects that are considered the easiest are: - Second Foreign Language (French or German) - Geology

Conclusion

The Sanawya Amma exam is a major turning point in the life of any Egyptian student. In this notebook, we uncovered some interesting insights about the exam. I will summarize the most important insights here: - The grades are normally distributed, which is a good sign that the exam is fair. - The Humanities branch has a lower average grade than the other branches. - Failing students tend to score around the average in all subjects. This means students shouldn’t ignore any subject. - The ministry is giving compassion grades. The myth is true. - The city affects the grades. Northern cities tend to have higher average grades than southern cities.

If you have finished the exam, I hope you found this analysis insightful. If you are a non-Egyptian, I hope this notebook gave you a glimpse into the Egyptian education system. If you are a student preparing for the exam, I hope you found these insights useful. Remember, the exam is fair, and you should focus on all subjects equally. Don’t ignore any subject. Good luck!

Future Work

Starting from the year 2024-2025, the Egyptian education system started a new system for the Sanawya Amma exam. The new system has different subjects, different structure, and different education style. I am happy that the Egyptian education system is taking these steps to improve itself and mantain the high educational standards in Egypt. I am optimistic that the upcoming change will only lead the nation forward towards a brighter future and better education for all.

Based on my personal knowledge of educational systems, I suggest that the new system add multiple trials for the students. This will relieve the students from the enormous presseure and stress imposed on them by the exam. Multiple trials give students breathing space and allow them to perform better and show their true potential. It also cancels the unfairness caussed by a bad day and gives students a second chance to prove themselves and show their true skill and knowledge. Other international tests like the SAT and the IGCSE implement the multiple trials system and it has proven to be very successful and beneficial for the students.

I hope to analyze the new system in the future and compare it to the old system. This will allow us to see the impact of the new system on the students’ grades and the education system in Egypt.