Exploring Egypt’s College Admission Test

2022 Sanawya Amma Results Analysis

education
analysis
data visualization
Author

Ibrahim Habib

Published

September 3, 2024

Every year, high school seniors rise to one of their largest life challenges: College Admission. Each country has its system for college admission shaping the career journey of its students. Most countries use some form of college admission test. Today, I will be exploring with you Egypt’s college admission test: Sanawya Amma (ثانوية عامة).

Taking the exam is a rite of passage for any Egyptian student. Students know that their whole future depends on the exam results. Hence, preparation for it starts months before the exams, and when the exams start the country seems to stop in unity, wishing success for these students.

It is no surprise that the students give the assessment this feeling of importance. Students enter the faculty and the university of their choice based on the exam’s result. After the exam, the ministry releases the minimum grade for each faculty. If a student scores below the grade assigned to his faculty of choice, he won’t be able to fulfill his dreams. Thus, the test is given this huge importance.

Given the vitality of this test, we are going to discover its results. We’re aiming to understand what factors into exam success, and also analyze the distribution of the marks. By the end, we will hopefully find useful insights for the students. So, let’s get started!

About the dataset

In this analysis, we will use the High School (ثانوية عامة) Public Results 2022 EG from Kaggle found here. This dataset contains the exam results for the year 2022. The data was collected from public websites that released the test results.

Let’s start by viewing some instances in the dataset and the avillable feature.

import pandas as pd

df = pd.read_csv('High_School_Public_Results_2022_EG_first_attempt.csv')
df.drop('desk_no', axis=1, inplace=True)
df.head()
school_name administration city branch Percentage status arabic first_foreign_lang second_foreign_lang pure_mathematics ... chemistry biology geology applied_math physics total religion national_education economics_statistics gender
0 الاورمان الرسمية لغات بنين الدقى الجيزة أدبي 87.80 ناجح 61.0 27.0 34.0 NaN ... NaN NaN NaN NaN NaN 360.0 17.0 25.0 37.0 M
1 جمال عبد الناصرالرسمية لغات بنات الدقى الجيزة علمي علوم 57.32 ناجح 47.0 25.0 26.0 NaN ... 30.0 30.0 44.0 NaN 33.0 235.0 18.0 20.0 27.0 F
2 هضبة الاهرام ث التجريبية لغات بنين الهرم الجيزة أدبي 83.41 ناجح 70.0 38.0 NaN NaN ... NaN NaN NaN NaN NaN 342.0 25.0 25.0 35.0 M
3 التحرير الرسمية لغات بنين أكتوبر الجيزة أدبي 53.17 ناجح 57.0 27.0 NaN NaN ... NaN NaN NaN NaN NaN 218.0 20.0 17.5 27.0 M
4 التحرير الرسمية لغات بنين أكتوبر الجيزة علمي رياضة 51.46 دور ثاني 56.0 25.0 6.0 30.0 ... 30.0 NaN NaN 33.0 31.0 211.0 18.0 20.0 31.0 M

5 rows × 24 columns

df.columns
Index(['school_name', 'administration', 'city', 'branch', 'Percentage',
       'status', 'arabic', 'first_foreign_lang', 'second_foreign_lang',
       'pure_mathematics', 'history', 'geography', 'philosophy', 'psychology',
       'chemistry', 'biology', 'geology', 'applied_math', 'physics', 'total',
       'religion', 'national_education', 'economics_statistics', 'gender'],
      dtype='object')

The data set contains the following columns: - school_name: The name of the student’s school. - administration: The administration of the student’s school. An administration is a group of schools that are managed by the same entity. They are divided based on the geographical location of the schools. - city: The city where the student is from. - branch: The branch of Sanawya Amma the student is in. There are three branches: Humanities (ادبي), Science (علمي علوم), and Mathematics (علمي رياضة). - Percentage: The student’s scaled grade in the exam out of 100. This is the most important feature as it determines the student’s future. - status: The student’s status in the exam. There are three statuses: Passed, Second Chance, and Failed. More on that in the next section. - [subject] columns: The student’s grade in the subject. We will see the subjects in the next section. - total: The student’s total grade in the exam. This is the sum of the student’s grades in all subjects. It is out of 410. The percentage is calculated by dividing the total by 410 and multiplying by 100. - gender: The student’s gender. Either M or F.

Sanawya Amma Structure

Branches

Sanawya Amma has three branches: Humanities (ادبي), Science (علمي علوم), and Mathematics (علمي رياضة). Each branch has its own set of subjects. The subjects are divided into two categories: Core and Elective. The core subjects are the same for all branches, while the elective subjects are different for each branch.

Let’s understand how the students are distributed among the branches.

Subjects

The core subjects are divided to two categories: those who affect the total grade and those who don’t. Students are only required to pass the subjects that affect the total grade. To ease the analysis, we will use core subjects to refer to the subjects that affect the total grade, elective subjects to electives that affect the total grade, and pass-fail subjects to the subjects that don’t affect the total grade. Note: that this is not the official terminology.

The core subjects are: - Arabic: The Arabic language scored out of 80. - First Foreign Language: The first foreign language, English, scored out of 50. - Second Foreign Language: The second foreign language, French or German, scored out of 40.

The elective subjects for each branch are: - Humanities: - History: Scored out of 60. - Geography: Scored out of 60. - Philosophy: Scored out of 60. - Psychology: Scored out of 60. - Science: - Biology: Scored out of 60. - Chemistry: Scored out of 60. - Physics: Scored out of 60. - Geology: Scored out of 60. - Mathematics: - Pure Mathematics: Scored out of 60. - Applied Mathematics: Scored out of 60. - Physics: Scored out of 60. - Chemistry: Scored out of 60.

The pass-fail subjects are: - Religion - National Education - Economics and Statistics

Students usually don’t prepare as well for the pass-fail subjects as they do for the core and elective subjects. Therefore, we will focus on the core and elective subjects in our analysis.

Note: This represents the 2022 Sanawya Amma exam and isn’t necessarily the same for other years.

Since the subjects are scored differently, let’s visualize how each subject affects the total grade.

import plotly.express as px

subjects = ['Arabic', 'First Foreign Language',
'Second Foreign Language', 'Pure Mathematics', 'History', 'Geography',
'Philosophy', 'Psychology', 'Chemistry', 'Biology', 'Geology',
'Applied Math', 'Physics']
max_grades = [80, 50, 40, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60]

branch_subjects_map = {
    'Humanities': ['Arabic', 'First Foreign Language', 'Second Foreign Language',
                   'History', 'Geography', 'Philosophy', 'Psychology'],
    
    'Science': ['Arabic', 'First Foreign Language', 'Second Foreign Language', 
                'Chemistry', 'Biology', 'Geology', 'Physics'],
    
    'Mathematics': ['Arabic', 'First Foreign Language', 'Second Foreign Language', 
                    'Pure Mathematics', 'Applied Math', 'Physics', 'Chemistry']
}

subjects_df = pd.DataFrame({'subject': subjects, 'max_grade': max_grades})


def plot_subjects_grade_contribution_for_branch(branch):
  branch_subjects = branch_subjects_map[branch]
  branch_subjects_df = subjects_df[subjects_df['subject'].isin(branch_subjects)].sort_values('max_grade', ascending=False)
  fig = px.pie(branch_subjects_df, values='max_grade', names='subject', 
               title=f'{branch} Branch Grades Distribution', color_discrete_sequence=px.colors.sequential.Blues_r,
                labels={'subject': 'Subject', 'max_grade': 'Max Grade'}, hole=0.3)
  fig.show()
  
plot_subjects_grade_contribution_for_branch('Humanities')
plot_subjects_grade_contribution_for_branch('Science')
plot_subjects_grade_contribution_for_branch('Mathematics')

Status

The students can have one of three statuses: - Passed: The student passed the exam, scored more than 50% every subject. - Second Chance: The student failed at most three subjects and is eligible for a second chance to pass them. - Failed: The student failed more than three subjects and has to retake all the subjects next year.

During the analyisis we won’t consider the retake for Second Chance students. We will only consider the first attempt; thus, we will consider the Second Chance students as Failed students.

Cleaning the dataset

Let’s start by renaming the columns to make them more readable.

df.columns
Index(['school_name', 'administration', 'city', 'branch', 'Percentage',
       'status', 'arabic', 'first_foreign_lang', 'second_foreign_lang',
       'pure_mathematics', 'history', 'geography', 'philosophy', 'psychology',
       'chemistry', 'biology', 'geology', 'applied_math', 'physics', 'total',
       'religion', 'national_education', 'economics_statistics', 'gender'],
      dtype='object')
columns_rename_map = {
  'school_name': 'School Name',
  'administration': 'Administration',
  'city': 'City',
  'branch': 'Branch',
  'status': 'Status',
  'arabic': 'Arabic',
  'first_foreign_lang': 'First Foreign Language',
  'second_foreign_lang': 'Second Foreign Language',
  'pure_mathematics': 'Pure Mathematics',
  'history': 'History',
  'geography': 'Geography',
  'philosophy': 'Philosophy',
  'psychology': 'Psychology',
  'chemistry': 'Chemistry',
  'biology': 'Biology',
  'geology': 'Geology',
  'applied_math': 'Applied Math',
  'physics': 'Physics',
  'total': 'Total',
  'religion': 'Religion',
  'national_education': 'National Education',
  'economics_statistics': 'Economics Statistics',
  'gender': 'Gender',
}

df = df.rename(columns=columns_rename_map)
df.head()
School Name Administration City Branch Percentage Status Arabic First Foreign Language Second Foreign Language Pure Mathematics ... Chemistry Biology Geology Applied Math Physics Total Religion National Education Economics Statistics Gender
0 الاورمان الرسمية لغات بنين الدقى الجيزة أدبي 87.80 ناجح 61.0 27.0 34.0 NaN ... NaN NaN NaN NaN NaN 360.0 17.0 25.0 37.0 M
1 جمال عبد الناصرالرسمية لغات بنات الدقى الجيزة علمي علوم 57.32 ناجح 47.0 25.0 26.0 NaN ... 30.0 30.0 44.0 NaN 33.0 235.0 18.0 20.0 27.0 F
2 هضبة الاهرام ث التجريبية لغات بنين الهرم الجيزة أدبي 83.41 ناجح 70.0 38.0 NaN NaN ... NaN NaN NaN NaN NaN 342.0 25.0 25.0 35.0 M
3 التحرير الرسمية لغات بنين أكتوبر الجيزة أدبي 53.17 ناجح 57.0 27.0 NaN NaN ... NaN NaN NaN NaN NaN 218.0 20.0 17.5 27.0 M
4 التحرير الرسمية لغات بنين أكتوبر الجيزة علمي رياضة 51.46 دور ثاني 56.0 25.0 6.0 30.0 ... 30.0 NaN NaN 33.0 31.0 211.0 18.0 20.0 31.0 M

5 rows × 24 columns

Let’s also rename the branches and the status to English to allow more people to understand the analysis.

branches_map = {
  'أدبي': 'Humanities',
  'علمي علوم': 'Science',
  'علمي رياضة': 'Mathematics'
}

status_map = {
  'ناجح': 'Pass',
  'دور ثاني': 'Fail',
  'راسب': 'Fail'
}

df['Branch'] = df['Branch'].map(branches_map)
df['Status'] = df['Status'].map(status_map)

df.head()
School Name Administration City Branch Percentage Status Arabic First Foreign Language Second Foreign Language Pure Mathematics ... Chemistry Biology Geology Applied Math Physics Total Religion National Education Economics Statistics Gender
0 الاورمان الرسمية لغات بنين الدقى الجيزة Humanities 87.80 Pass 61.0 27.0 34.0 NaN ... NaN NaN NaN NaN NaN 360.0 17.0 25.0 37.0 M
1 جمال عبد الناصرالرسمية لغات بنات الدقى الجيزة Science 57.32 Pass 47.0 25.0 26.0 NaN ... 30.0 30.0 44.0 NaN 33.0 235.0 18.0 20.0 27.0 F
2 هضبة الاهرام ث التجريبية لغات بنين الهرم الجيزة Humanities 83.41 Pass 70.0 38.0 NaN NaN ... NaN NaN NaN NaN NaN 342.0 25.0 25.0 35.0 M
3 التحرير الرسمية لغات بنين أكتوبر الجيزة Humanities 53.17 Pass 57.0 27.0 NaN NaN ... NaN NaN NaN NaN NaN 218.0 20.0 17.5 27.0 M
4 التحرير الرسمية لغات بنين أكتوبر الجيزة Mathematics 51.46 Fail 56.0 25.0 6.0 30.0 ... 30.0 NaN NaN 33.0 31.0 211.0 18.0 20.0 31.0 M

5 rows × 24 columns

Now, let’s check for missing values.

df.isnull().sum().sort_values(ascending=False)
Pure Mathematics           585619
Applied Math               585543
History                    424193
Psychology                 424051
Philosophy                 424014
Geography                  423921
Biology                    359913
Geology                    359572
Physics                    264460
Chemistry                  263995
Economics Statistics        72219
Religion                    71800
National Education          71608
Second Foreign Language      5690
First Foreign Language       5109
Arabic                       3792
Gender                          3
Administration                  0
Status                          0
Total                           0
Percentage                      0
Branch                          0
City                            0
School Name                     0
dtype: int64

We find that there are missing values in the Gender column. Since we won’t be using this column in the analysis, we can ignore the missing values.

It would be ok if a student has missing values in the subjects columns given that the student is not taking the subject or he failed. So, let’s search for students contradicting this rule.

def has_missing_subject(row):
  student_branch_subjects = branch_subjects_map[row['Branch']]
  for subject in student_branch_subjects:
    if pd.isnull(row[subject]):
      return True
  return False

df['Has Missing Subject'] = df.apply(has_missing_subject, axis=1)
len(df[df['Has Missing Subject'] & (df['Status'] == 'Pass')]) / len(df) * 100
0.19945834090522724

We find that such students exist; however, they are very few, only 0.2% of the students. This is probably due to a mistake in the data entry. We can safely remove these students.

To ease the analysis, let’s scale the grades of the subjects to be out of 100. This will allow us to compare the grades of the subjects.

def scale_subjects(df):
  for subject in subjects:
    df[subject] = df[subject] / subjects_df[subjects_df['subject'] == subject]['max_grade'].values[0] * 100
  return df

df = scale_subjects(df)
df.describe()
Percentage Arabic First Foreign Language Second Foreign Language Pure Mathematics History Geography Philosophy Psychology Chemistry Biology Geology Applied Math Physics Total Religion National Education Economics Statistics
count 682348.000000 678556.000000 677239.000000 676658.000000 96729.000000 258155.000000 258427.000000 258334.000000 258297.000000 418353.000000 322435.000000 322776.000000 96805.000000 417888.000000 682348.000000 610548.000000 610740.000000 610129.000000
mean 63.135943 62.578077 65.820089 72.626713 62.018112 54.623944 55.218434 63.507159 62.866113 64.482905 59.958705 72.446738 65.752131 59.311801 258.857377 19.023293 18.301424 29.652762
std 14.879646 14.414757 20.419700 20.723767 19.280541 16.187493 13.967272 13.460551 11.941338 19.698213 15.429559 17.856085 18.961954 20.245909 61.006543 2.863957 3.237647 5.241791
min 0.000000 1.250000 2.000000 2.500000 1.666667 1.666667 1.666667 1.666667 1.666667 3.333333 1.666667 1.666667 1.666667 1.666667 0.000000 1.000000 1.000000 2.000000
25% 52.930000 50.000000 50.000000 55.000000 50.000000 50.000000 50.000000 53.333333 55.000000 50.000000 50.000000 60.000000 50.000000 50.000000 217.000000 17.000000 16.000000 25.000000
50% 62.680000 61.250000 66.000000 75.000000 61.666667 50.000000 51.666667 65.000000 63.333333 63.333333 58.333333 76.666667 66.666667 55.000000 257.000000 19.000000 19.000000 29.000000
75% 73.660000 73.750000 82.000000 90.000000 76.666667 65.000000 63.333333 73.333333 71.666667 81.666667 71.666667 86.666667 81.666667 75.000000 302.000000 21.000000 21.000000 33.000000
max 99.270000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 407.000000 25.000000 25.000000 50.000000

Let’s also add a Min Grade, Max Grade, and Mean Grade columns for each student. This will allow us to ensure Status is correctly assigned.

def get_min_grade(row):
  student_branch_subjects = branch_subjects_map[row['Branch']]
  return row[student_branch_subjects].min()

def get_max_grade(row):
  student_branch_subjects = branch_subjects_map[row['Branch']]
  return row[student_branch_subjects].max()

def get_mean_grade(row):
  student_branch_subjects = branch_subjects_map[row['Branch']]
  return row[student_branch_subjects].mean()

df['Min Grade'] = df.apply(get_min_grade, axis=1)
df['Max Grade'] = df.apply(get_max_grade, axis=1)
df['Mean Grade'] = df.apply(get_mean_grade, axis=1)
df.describe()  
Percentage Arabic First Foreign Language Second Foreign Language Pure Mathematics History Geography Philosophy Psychology Chemistry ... Geology Applied Math Physics Total Religion National Education Economics Statistics Min Grade Max Grade Mean Grade
count 682348.000000 678556.000000 677239.000000 676658.000000 96729.000000 258155.000000 258427.000000 258334.000000 258297.000000 418353.000000 ... 322776.000000 96805.000000 417888.000000 682348.000000 610548.000000 610740.000000 610129.000000 681293.000000 681293.000000 681293.000000
mean 63.135943 62.578077 65.820089 72.626713 62.018112 54.623944 55.218434 63.507159 62.866113 64.482905 ... 72.446738 65.752131 59.311801 258.857377 19.023293 18.301424 29.652762 47.976382 80.370899 64.126046
std 14.879646 14.414757 20.419700 20.723767 19.280541 16.187493 13.967272 13.460551 11.941338 19.698213 ... 17.856085 18.961954 20.245909 61.006543 2.863957 3.237647 5.241791 17.380803 14.787331 14.623857
min 0.000000 1.250000 2.000000 2.500000 1.666667 1.666667 1.666667 1.666667 1.666667 3.333333 ... 1.666667 1.666667 1.666667 0.000000 1.000000 1.000000 2.000000 1.250000 1.250000 1.250000
25% 52.930000 50.000000 50.000000 55.000000 50.000000 50.000000 50.000000 53.333333 55.000000 50.000000 ... 60.000000 50.000000 50.000000 217.000000 17.000000 16.000000 25.000000 32.000000 70.000000 53.464286
50% 62.680000 61.250000 66.000000 75.000000 61.666667 50.000000 51.666667 65.000000 63.333333 63.333333 ... 76.666667 66.666667 55.000000 257.000000 19.000000 19.000000 29.000000 50.000000 82.500000 63.666667
75% 73.660000 73.750000 82.000000 90.000000 76.666667 65.000000 63.333333 73.333333 71.666667 81.666667 ... 86.666667 81.666667 75.000000 302.000000 21.000000 21.000000 33.000000 58.333333 92.500000 74.821429
max 99.270000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 ... 100.000000 100.000000 100.000000 407.000000 25.000000 25.000000 50.000000 100.000000 100.000000 100.000000

8 rows × 21 columns

df[(df['Min Grade'] < 50) & (df['Status'] == 'Pass')]
School Name Administration City Branch Percentage Status Arabic First Foreign Language Second Foreign Language Pure Mathematics ... Physics Total Religion National Education Economics Statistics Gender Has Missing Subject Min Grade Max Grade Mean Grade
158069 ليسية الهرم الخاصة لغات العمرانية الجيزة Mathematics 40.98 Pass 52.50 52.0 50.0 31.666667 ... 23.333333 168.0 16.0 14.0 33.0 M False 23.333333 55.0 41.119048
158074 ليسية الهرم الخاصة لغات العمرانية الجيزة Mathematics 42.20 Pass 50.00 64.0 50.0 21.666667 ... 26.666667 173.0 20.0 14.0 25.0 M False 21.666667 64.0 42.714286
158205 محرم الاسلامية الخاصة لغات العمرانية الجيزة Science 38.54 Pass 30.00 50.0 57.5 NaN ... 50.000000 158.0 18.0 16.0 25.0 M False 26.666667 57.5 40.119048
158224 محرم الاسلامية الخاصة لغات العمرانية الجيزة Mathematics 39.76 Pass 20.00 50.0 50.0 50.000000 ... 50.000000 163.0 16.0 14.0 25.0 M False 20.000000 50.0 41.428571
158225 محرم الاسلامية الخاصة لغات العمرانية الجيزة Mathematics 47.32 Pass 50.00 68.0 85.0 50.000000 ... 30.000000 194.0 19.0 13.0 25.0 M False 30.000000 85.0 49.476190
158253 الواحة الخاصة لغات العمرانية الجيزة Science 47.32 Pass 51.25 60.0 77.5 NaN ... 26.666667 194.0 20.0 13.0 26.0 F False 26.666667 77.5 48.869048
158262 ابن عطاء الله الخاصة لغات العمرانية الجيزة Humanities 45.37 Pass 56.25 54.0 70.0 NaN ... NaN 186.0 15.0 17.0 25.0 M False 23.333333 70.0 46.226190
158271 ابن عطاء الله الخاصة لغات العمرانية الجيزة Science 49.51 Pass 56.25 68.0 50.0 NaN ... 50.000000 203.0 21.0 19.0 29.0 F False 18.333333 68.0 49.654762

8 rows × 28 columns

We find this unexpected behavior in the data. There are 8 students who have a total grade less than 50 but are marked as Passed. This is probably due to a mistake in the data entry. We will remove these students.

df = df[~((df['Min Grade'] < 50) & (df['Status'] == 'Pass'))]
df[(df['Min Grade'] < 50) & (df['Status'] == 'Pass')]
School Name Administration City Branch Percentage Status Arabic First Foreign Language Second Foreign Language Pure Mathematics ... Physics Total Religion National Education Economics Statistics Gender Has Missing Subject Min Grade Max Grade Mean Grade

0 rows × 28 columns

How are the grades distributed?

Most well-known standardized tests have a normal distribution. This means that most students score around the average, and the further you go from the average, the fewer students you find. Having a normal distribution is a good thing because it means that the test is fair. Some inistitutions might have a policy to adjust the grades to make them normally distributed. This is usually done by curving the grades.

One example of such test is the SAT. The SAT is a standardized test used for college admission in the United States and American Diploma students in other countries. The SAT is designed to have a normal distribution. The College Board, the organization that administers the SAT, doesn’t score the grades on a curve. This means your grade isn’t affected by how other students perform. Yet, the SAT remains standerdized and grades in one year are comparable to grades from another year. College Board succeded in making the score distribution normal by designing the test to be fair and accurate.

Let’s see if the Sanawya Amma grades are normally distributed.

fig = px.histogram(df, x='Percentage', title='Students Grades Distribution')

fig.add_vline(x=50, line_width=2, line_dash="dash", line_color="red")
fig.add_vline(x=df['Percentage'].median(), line_width=2, line_dash="dash", line_color="red")

fig.update_layout(
    annotations=[
          dict(x=50, y=5000, xref="x", yref="y", text="Passing Grade",
               ax=0, ay=0, font=dict(size=14), bgcolor='#eeeeee'),
          dict(x=df['Percentage'].mean(), y=5500, xref="x", yref="y", text=f"Median Grade {df['Percentage'].mean():.2f}",
               ax=0, ay=0, font=dict(size=14), bgcolor='#eeeeee'),
     ],
     xaxis_title='Grade out of 100',
     yaxis_title='Number of Students',
)

fig.show()

We see in the histogram that the grades are approximately normally distributed. This is a good sign that the test is fair. The grades however are not centered around 50% as we would expect. The median grade is ariund 63%. This means there are more students scoring above the average than passing grade rather than below it. This indicates that there are much more students passing the exam than failing it. We can see this in the pie chart below.

px.pie(df, names='Status', title='Students Status Distribution')

The pie char shows that the number of passing students is double the number of failing students.

Are all branches equal?

Let’s see how the students are distributed among the branches.

Note this graph is interactive. You can hover over the bars to see the exact number of students, and you can click on the legend to hide/show a branch.

colors = ['#3b82f6', '#22c55e', '#ef4444']
fig = px.histogram(df, x='Percentage', color='Branch', title='Students Grade Distribution by Branch', 
                   barmode='overlay', histnorm='percent', color_discrete_sequence=colors)

for (i, branch) in enumerate(df['Branch'].unique()):
  branch_students = df[df['Branch'] == branch]
  # Add annotations only on hover
  fig.add_vline(x=branch_students['Percentage'].median(), line_width=2, line_dash="dash", line_color=colors[i],
                name=f'{branch} Median')
  
fig.update_layout(
    annotations=[
          dict(x=80, y=1.1, xref="x", yref="y", text="Median Grade for each Branch",
               ax=0, ay=0, font=dict(size=14), bgcolor='#eeeeee'),
     ],
     legend_title_text='Branch',
     xaxis_title='Grade out of 100',
     yaxis_title='Percentage of Students (%)'
)

fig.show()

As we can see from the histogram, the grades for the Science and Mathematics branches are very simmilar. However, the Humanities branch has a lower much lower average grade. We can see in the histogram that the median grade for the Humanities branch is around 60%, while for the other branches it is around 65%. Moreover, the number of students scoring above 90% is much lower in the Humanities branch than in the other branches. This indicates one of two things: 1. The Humanities branch is harder than the other branches. 2. The students in the Humanities branch are less prepared than the other branches.

With the data we have, we can’t determine which of the two is true. However, we can see that the Humanities branch has a lower average grade.

How common is it to succeed in total yet fail in a subject?

Why do failing students fail?

A student is considered to have passed the exam if he scored more than 50% in every subject. Let’s see how common it is for a student to have a total grade above 50% yet fail in a subject.

fig = px.histogram(df, x='Percentage', color='Status', title='Students Grade Distribution by Status',
              barmode='overlay')

fig.update_layout(
  xaxis_title='Grade out of 100',
  yaxis_title='Number of Students',
)

fig.show()

We make a very interesting observation. The grades distribution for failing student is normal and centered around 50%.

There is a common misconception that failing students are those who score very low in all subjects. However, this is not true. The grades of failing students are normally distributed around 50%. This means that failing students are those who score around the average in all subjects. This is a very interesting observation.

We can explore this further by seeing how common it is for a student to have a total grade above 50% yet fail in a subject.

failings = df[df['Status'] == 'Fail']
failings['Is Above 50'] = failings['Percentage'] >= 50

fig = px.pie(failings, names='Is Above 50', title='Failings Distribution Above and Below 50%')

fig.show()
/var/folders/vp/fng1b7k14nq4khc8kp2lp6wh0000gn/T/ipykernel_96774/2663360737.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

That’s astonishing to see. Around 44% of the students who failed the exam had a total grade above 50%. This means that almost half of the students who failed the exam scored above the passing grade in every subject.

This is a very interesting observation. It means that failing students are not those who score very low in all subjects. Instead, they are those who score around the average in all subjects. This is a very important insight for students preparing for the exam. It means that they should focus on all subjects equally. They shouldn’t ignore a subject just because they don’t like it or because they think they are bad at it. They should focus on all subjects equally to pass the exam. If you are sooo good in one subject yet horrible in another, you are likely to fail.

failings['Min Mean Difference'] = failings['Mean Grade'] - failings['Min Grade']

fig = px.histogram(failings, x='Min Mean Difference', title='Failings Students Average Grade - Min Grade Distribution')

fig.add_vline(x=failings['Min Mean Difference'].mean(), line_width=2, line_dash="dash", line_color="red",
              annotation_text=f"Mean {failings['Min Mean Difference'].mean():.2f}")

fig.update_layout(
  xaxis_title='Average Grade - Min Grade for failing students',
  yaxis_title='Number of Students',
)

fig.show()
/var/folders/vp/fng1b7k14nq4khc8kp2lp6wh0000gn/T/ipykernel_96774/2926780058.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

This graph again proves the same point. The average difference between a failing students minimum grade and his average grade is 21.48%. This is a very large difference. Once again this proves that failing students are those who ignore a subject and focus on others. This is a very important advice for students preparing for the exam: Don’t ignore any subject.

Are the Compassion Grades a myth?

There is a common belief among students that the ministry gives compassion grades to students who are close to passing. This means that if a student is very close to passing, the ministry will give him a few extra points to pass.

When we ploted the grades distribution, we saw that the grades are normally distributed. This means that the grades are fair and accurate. If the ministry was giving compassion grades, the grades wouldn’t be normally distributed. Instead, we would see a spike in the grades just above the passing grade. This is not the case. The total grades are normally distributed.

However, the marks are added on the subjects level. Hence, the myth might be true although the total grades are normally distributed. Let’s start exploring this by checking the distribution of the for the Arabic subject.

fig = px.histogram(df, x='Arabic', title='Arabic Grades Distribution')

fig.show()

WOW!

The myth seems true. The grades for the Arabic subject are not normally distributed. Instead, there is a HUGE spike in the grades just above the passing grade, and then the grades drop to 0. This is a clear indication that the ministry is giving compassion grades, but before making a conclusion, let’s quantify this phenomenon and check other subjects.

For each subject, we will calculate the percentage of students who failed within 2.5% of the passing grade and the percentage of students who passed within 2.5% of the passing grade. We will then compare these percentages to see if the ministry is giving compassion grades.

If the ministry is giving compassion grades, we would expect to see a much higher percentage of students passing within 2.5% of the passing grade than the percentage of students failing within 2.5% of the passing grade.

def get_barely_passing_students_percentage(df, subject, margin=2.5, threshold=50):
  return len(df[(df[subject] >= threshold) & (df[subject] <= threshold + margin)]) / len(df[df[subject] >= threshold]) * 100

def get_barely_failing_students_percentage(df, subject, margin=2.5, threshold=50):
  return len(df[(df[subject] < threshold) & (df[subject] >= threshold - margin)]) / len(df[df[subject] < threshold]) * 100

barely_passed_perc = [get_barely_passing_students_percentage(df, subject) for subject in subjects]
barely_failed_perc = [get_barely_failing_students_percentage(df, subject) for subject in subjects]

barely_passed_df = pd.DataFrame({'Subject': subjects, 'Barely Passing Percentage': barely_passed_perc, 
                                 'Barely Failing Percentage': barely_failed_perc})

fig = px.bar(barely_passed_df, x='Subject', y=['Barely Passing Percentage', 'Barely Failing Percentage'], barmode='group',
              title='Barely Passing and Failing Students Percentage by Subject', labels={'value': 'Percentage', 'variable': 'Status'})

fig.update_layout(
  yaxis_title='Percentage of Students (%)',
  xaxis_title='Subject',
)

fig.show()

The difference is so big that in the above bar chart we can’t even see the percentage of students who failed within 2.5% of the passing grade. So, let’s print the exact percentages to see the difference.

barely_passed_df
Subject Barely Passing Percentage Barely Failing Percentage
0 Arabic 24.907257 0.000000
1 First Foreign Language 20.700964 0.004478
2 Second Foreign Language 18.067144 0.004093
3 Pure Mathematics 28.560174 0.008562
4 History 46.203238 0.000000
5 Geography 43.848881 0.000000
6 Philosophy 15.973283 0.000000
7 Psychology 16.007772 0.000000
8 Chemistry 24.637011 0.000000
9 Biology 34.855668 0.000000
10 Geology 9.788448 0.000000
11 Applied Math 24.087328 0.000000
12 Physics 38.149330 0.000000

The difference is incredible. For 9 of the subjects, no student failed within 2.5% of the passing grade. This is a clear indication that the ministry is giving compassion grades. The myth is true.

Does city affect the grades?

An interesting question to ask is whether the city affects the grades. Are students from some cities more likely to score higher than students from other cities?

To make this analysis readable to a wider audience, I will begin by translating the city names to English.

city_english_names_map = {
  'القاهرة': 'Cairo',
  'الإسكندرية': 'Alexandria',
  'الجيزة': 'Giza',
  'الدقهلية': 'Dakahlia',
  'البحيرة': 'Beheira',
  'المنوفية': 'Monufia',
  'الشرقية': 'Sharqia',
  'الغربية': 'Gharbia',
  'الفيوم': 'Fayoum',
  'القليوبية': 'Qalyubia',
  'المنيا': 'Minya',
  'الاقصر': 'Luxor',
  'البحر الأحمر': 'Red Sea',
  'الوادي الجديد': 'New Valley',
  'السويس': 'Suez',
  'الاسكندرية': 'Alexandria',
  'الإسماعيلية': 'Ismailia',
  'اسوان': 'Aswan',
  'اسيوط': 'Assiut',
  'بني سويف': 'Beni Suef',
  'بورسعيد': 'Port Said',
  'جنوب سيناء': 'South Sinai',
  'دمياط': 'Damietta',
  'سوهاج': 'Sohag',
  'قنا': 'Qena',
  'كفر الشيخ': 'Kafr El Sheikh',
  'مطروح': 'Matrouh',
  'شمال سيناء': 'North Sinai'
}

df['City'] = df['City'].map(city_english_names_map)
df['City'].unique()
array(['Giza', 'Beni Suef', 'Fayoum', 'Cairo', 'Ismailia', 'Port Said',
       'Monufia', 'Gharbia', 'Alexandria', 'Sharqia', 'Assiut',
       'Damietta', 'Sohag', 'Dakahlia', 'Suez', 'Qalyubia', 'Beheira',
       'Kafr El Sheikh', 'Qena', 'Aswan', 'Matrouh', 'Minya',
       'North Sinai', 'Red Sea', 'South Sinai', 'Luxor', 'New Valley'],
      dtype=object)

Now let’s see how the grades are distributed among the cities.

cities = df['City'].unique()
student_count = [len(df[df['City'] == city]) for city in cities]
average_grade = [df[df['City'] == city]['Percentage'].mean() for city in cities]

cities_df = pd.DataFrame({'City': cities, 'Student Count': student_count, 'Average Grade': average_grade})
cities_df['Average Grade'] = cities_df['Average Grade'].round(1)

fig = px.treemap(
  cities_df,
  path=['City'],
  values='Average Grade',
  color='Average Grade',
  color_continuous_scale='Blues',
  title='Grades Distribution over Cities'
)

fig.show()

From the above tree map, we see that some cities have a much higher average grade than others. For example, students from North Sinai have average grades 74.6% while students from Minya have average grades 55.5%. This is a very large difference.

Now, I would like to know if the city’s average grade is related to the number of students from the city and the city’s geographical location.

fig = px.treemap(
  cities_df,
  path=['City'],
  values='Student Count',
  color='Average Grade',
  color_continuous_scale='Blues',
  title='Grades Distribution over Cities by Student Count'
)

fig.show()

We notice a weak negative correlation between the city’s average grade and the number of students from the city. This means that cities with fewer students have higher average grades; however, the correlation isn’t very strong.

city_coordinates_map = {
  'Cairo': (30.033333, 31.233334),
  'Alexandria': (31.200092, 29.918739),
  'Giza': (30.013056, 31.208853),
  'Dakahlia': (31.034431, 31.380691),
  'Beheira': (30.469561, 30.931739),
  'Monufia': (30.464713, 31.18422),
  'Sharqia': (30.587676, 31.501218),
  'Gharbia': (30.793408, 31.012645),
  'Fayoum': (29.3084, 30.8416),
  'Qalyubia': (30.328454, 31.243225),
  'Minya': (28.1187, 30.7416),
  'Luxor': (25.6872, 32.6396),
  'Red Sea': (26.7153, 33.9368),
  'New Valley': (25.6773, 28.8955),
  'Suez': (30.0075, 32.5498),
  'Ismailia': (30.5903, 32.2653),
  'Aswan': (24.088938, 32.899829),
  'Assiut': (27.1828, 31.0014),
  'Beni Suef': (29.0734, 31.0994),
  'Port Said': (31.2565, 32.2841),
  'South Sinai': (28.4177, 33.0451),
  'Damietta': (31.4165, 31.8133),
  'Sohag': (26.5569, 31.6948),
  'Qena': (26.1553, 32.7163),
  'Kafr El Sheikh': (31.1093, 30.9367),
  'Matrouh': (31.3549, 27.2373),
  'North Sinai': (31.2156, 33.3581)
}

cities_df['Latitude'] = cities_df['City'].map(lambda city: city_coordinates_map[city][0])
cities_df['Longitude'] = cities_df['City'].map(lambda city: city_coordinates_map[city][1])

fig = px.scatter_mapbox(
  cities_df,
  lat='Latitude',
  lon='Longitude',
  color='Average Grade',
  size='Student Count',
  hover_name='City',
  zoom=4,
  title='Grades Distribution over cities on Egypt Map',
  labels={'Average Percentage': 'Average Percentage'},
  color_continuous_scale='tealrose'
)

fig.update_layout(mapbox_style='open-street-map')

fig.show()
/var/folders/vp/fng1b7k14nq4khc8kp2lp6wh0000gn/T/ipykernel_96774/931561192.py:34: DeprecationWarning:

*scatter_mapbox* is deprecated! Use *scatter_map* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/

From the map, it seems that northern cities tend to have higher average grades than southern cities.

Do all subjects have the same difficulty?

It is no secret that a student tend to find some subjects easier than others, but what’s important are subjects that are precieved as difficult by most students. To figure this out we will plot a bbar chart for the mean grade of each subject.

subject_avg_grades = [df[subject].mean() for subject in subjects]
subject_avg_grades_df = pd.DataFrame({'Subject': subjects, 'Average Grade': subject_avg_grades})
subject_avg_grades_df = subject_avg_grades_df.sort_values('Average Grade', ascending=False)

fig = px.bar(subject_avg_grades_df, x='Subject', y='Average Grade', title='Average Grades by Subject')
fig.show()

From this graph, we can see that the subjects are not equally difficult. The subjects that are considered the most difficult are: - History - Geography

The subjects that are considered the easiest are: - Second Foreign Language (French or German) - Geology

Conclusion

The Sanawya Amma exam is a major turning point in the life of any Egyptian student. In this notebook, we uncovered some interesting insights about the exam. I will summarize the most important insights here: - The grades are normally distributed, which is a good sign that the exam is fair. - The Humanities branch has a lower average grade than the other branches. - Failing students tend to score around the average in all subjects. This means students shouldn’t ignore any subject. - The ministry is giving compassion grades. The myth is true. - The city affects the grades. Northern cities tend to have higher average grades than southern cities.

If you have finished the exam, I hope you found this analysis insightful. If you are a non-Egyptian, I hope this notebook gave you a glimpse into the Egyptian education system. If you are a student preparing for the exam, I hope you found these insights useful. Remember, the exam is fair, and you should focus on all subjects equally. Don’t ignore any subject. Good luck!

Future Work

Starting from the year 2024-2025, the Egyptian education system started a new system for the Sanawya Amma exam. The new system has different subjects, different structure, and different education style. I am happy that the Egyptian education system is taking these steps to improve itself and mantain the high educational standards in Egypt. I am optimistic that the upcoming change will only lead the nation forward towards a brighter future and better education for all.

Based on my personal knowledge of educational systems, I suggest that the new system add multiple trials for the students. This will relieve the students from the enormous presseure and stress imposed on them by the exam. Multiple trials give students breathing space and allow them to perform better and show their true potential. It also cancels the unfairness caussed by a bad day and gives students a second chance to prove themselves and show their true skill and knowledge. Other international tests like the SAT and the IGCSE implement the multiple trials system and it has proven to be very successful and beneficial for the students.

I hope to analyze the new system in the future and compare it to the old system. This will allow us to see the impact of the new system on the students’ grades and the education system in Egypt.