Application of Linear Regression with R – Predicting Final Grades based on GPA

Application of Linear Regression with R

Predicting Final Grades based on GPA

In the problem below, our objective is to identify if there exists a cutoff GPA such that if a student’s’ GPA is below that threshold they tend to fail the class. This allows to identify at-risk students early in the course. Since the GPA is collected at the beginning of the course and the final grades are obtained at the end of the course, it implies that the GPA should be the predictor variable even though we want to identify a cutoff GPA. Therefore, the model we will fit is

final grade ~ GPA

Let us assume that the final grade is linearly related to the students’ GPA. We can check for linearity in a later section. 

> df = read.csv("gpa.csv")

The imported data looks like this. Names are withheld.

Student CurrentGPA TotalAbsent NumericGrade
***ia 3 15.04 89
***lheid 3.67 21.52 83
***ue 1.2 31.63 17
***tha 2.5 24.56 71
***raina 2 54.04 15
***ia 2.8 13.93 72
***issy 3.33 24.33 79
***ob 2.78 8.15 80
***uel 2.67 10.56 73
***hael 3.1 8.26 79

If we plot the quantile values 

> boxplot(df$NumericGrade)

it would look like this

a1

and if we plot GPA vs FinalGrade it will look like this

a2

We fit the linear model

> gpafit = lm(NumericGrade ~ CurrentGPA, data=df)

Ideally, we should do a test-train split and fit the training data and repeat till we reduce the bias. However, we are going for efficiency her and not accuracy. We plot the fitted line (red line) on the scatter plot of GPA vs FinalGrade.

> abline(gpafit, col="red")

a4

We want to set a cutoff for the passing grade of 70 and see what is the cutoff GPA for getting 70 and above. Now, we have obtained FinalGrade as a function of GPA and not the other way around.  To predict the GPA (x-value) given FinalGrade (y-value) we need to use the approx() function.

> approx(gpafit$fitted.values, df$CurrentGPA, 70)
$x
[1] 70

$y
[1] 2.680097

From the output we see that when x=70, y=2.68 where x is now the FinalGrade and y is the GPA. approx() takes in a list of x-values and the corresponding y-values as tries a linear interpolation of x_pred to get y_pred = approx(x,y,x_pred). We draw two lines, one at FinalGrade = 70 and one at GPA=2.68 to show our cutoffs.

> plot(df$CurrentGPA, df$NumericGrade, col="blue", 
      xlab = "GPA", ylab="FinalGrade")
> abline(gpafit, col="red")
> abline(v=2.68)
> abline(h=70)

a5

It looks like any student with a GPA>2.68 will tend to pass the course. There are some students who passed the course in spite of having GPA<2.68. But all students with GPA>2.68 passed except one. We can even identify the students by doing a python -like slicing of the dataframe.

> df[df$NumericGrade<70 & df$CurrentGPA>2.68,]
   Student CurrentGPA TotalAbsent NumericGrade
59 ***iola          3       28.04           46

We will run the regression analysis with all the predictor variables including student absenteeism in a later post.

Leave a comment