Predicting Stock Price Volatility with GARCH Model (Part 1)

In time series analysis, it is essential to model the volatility of a stock. One way to achieve this is through the use of the EGARCH (Exponential Generalized Autoregressive Conditional Heteroskedasticity) model. In this article, we will perform an analysis of the MSFT stock price using EGARCH to model its volatility.

GARCH Model and When to Use It

GARCH (Generalized Autoregressive Conditional Heteroskedasticity) is a statistical model used to analyze financial time series data. It is a type of ARCH (Autoregressive Conditional Heteroskedasticity) model that takes into account the volatility clustering often observed in financial data. The GARCH model assumes that the variance of the error term in a time series is a function of both past error terms and past variances.

The GARCH model is commonly used in finance to model and forecast the volatility of asset returns. In particular, it is useful for predicting the likelihood of extreme events, such as a sudden stock market crash or a sharp increase in volatility.

When deciding whether to use a GARCH model, it is important to consider the characteristics of the financial time series data being analyzed. If the data exhibits volatility clustering or other patterns of heteroskedasticity, a GARCH model may be appropriate. Additionally, GARCH models are often used when the goal is to forecast future volatility or to estimate the risk associated with an investment. The GARCH(p,q) model can be represented by the following equation:

\begin{aligned} r_t &= \mu_t + \epsilon_t \\ \epsilon_t &= \sigma_t z_t \\ \sigma_t^2 &= \omega + \sum_{i=1}^p \alpha_i \epsilon_{t-i}^2 + \sum_{j=1}^q \beta_j \sigma_{t-j}^2 \end{aligned}

where r_t is the log return at time t, \mu_t is the conditional mean at time t, \epsilon_t is the standardized residual at time t, \sigma_t is the conditional standard deviation at time t, z_t is a standard normal random variable, \omega is the constant, \alpha_i and \beta_i are the GARCH and ARCH coefficients at lag i, and p and q are the order of the GARCH and ARCH terms, respectively.

In a GARCH(p,q) model, the dependence on the error term and the volatility term at the same time reflects the notion of volatility clustering, which is a characteristic of financial time series data. The error term represents the current shock or innovation to the return series, while the volatility term captures the past history of the shocks. The dependence on the error term and the volatility term at the same time implies that the model recognizes that a current shock to the return series can have a persistent effect on future volatility. In other words, large shocks tend to be followed by large subsequent changes in volatility, and vice versa. This feature of GARCH models has important implications for risk management and financial decision-making. By accounting for the clustering of volatility, GARCH models can provide more accurate estimates of risk measures, such as Value-at-Risk (VaR) and Expected Shortfall (ES), which are used to assess the potential losses in financial portfolios. GARCH models can also be used to forecast future volatility, which can be useful for developing trading strategies and hedging positions in financial markets. We will explore these concepts in the future parts of this ongoing series.

One specific form of the GARCH model is the EGARCH model, which stands for Exponential Generalized Autoregressive Conditional Heteroskedasticity. The EGARCH model allows for both asymmetry and leverage effects in the volatility of the data. The EGARCH model can be represented by the following equation:

\begin{aligned} r_t &= \mu_t + \epsilon_t \\ \epsilon_t &= \sigma_t z_t \\ \log(\sigma_t^2) &= \omega + \sum_{i=1}^p \alpha_i \left( \frac{\left| \epsilon_{t-i} \right|}{\sigma_{t-i}} - \sqrt{\frac{2}{\pi}} \right) + \sum_{i=1}^q \beta_i \log \sigma_{t-i}^2 \end{aligned}

where r_t is the log return at time t, \mu_t is the conditional mean at time t, \epsilon_t is the standardized residual at time t, \sigma_t is the conditional standard deviation at time t, z_t is a standard normal random variable, \omega is the constant, \alpha_i and \beta_i are the GARCH and ARCH coefficients at lag i, and p and q are the order of the GARCH and ARCH terms, respectively.

Exploratory Data Analysis

Before modeling, it is essential to explore the data to understand its characteristics. The plot below shows the time series plot of the MSFT stock price.

import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import statsmodels.api as sm
import pmdarima as pm
import yfinance as yf
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller

msft = yf.Ticker('MSFT')
df = msft.history(period='5y')

sns.light_palette("seagreen", as_cmap=True)
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.lineplot(df['Close'])
plt.title('MSFT')

We can see that the stock price exhibits a clear upward trend over the period, with some fluctuations. The plot below shows the ACF and PACF plots of the first differences of the stock price.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# ACF and PACF plots of first differences
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
plot_acf(msft_log, ax=axes[0])
plot_pacf(msft_log, ax=axes[1])
plt.tight_layout()
plt.show()

From the ACF and PACF plots, we can observe that there is no clear pattern in the data, indicating that it may be a white noise process. However, there is some significant autocorrelation at lag 1 in the PACF plot, suggesting that we may need to include an AR term in our model.

Model Selection

To model the volatility of the MSFT stock price, we will use the EGARCH model. We will begin by fitting a baseline EGARCH(1,1) model and compare it with other models.

from arch import arch_model

# Fit EGARCH(1,1) model
egarch11_model = arch_model(msft_log, vol='EGARCH',
                               p=1, o=0, q=1, dist='Normal')
egarch11_fit = egarch11_model.fit()
print(egarch11_fit.summary())


Constant Mean - GARCH Model Results                      
===============================================================
Dep. Variable:                  Close   R-squared:                       0.000
Mean Model:             Constant Mean   Adj. R-squared:                  0.000
Vol Model:                      GARCH   Log-Likelihood:                3350.57
Distribution:                  Normal   AIC:                          -6693.15
Method:            Maximum Likelihood   BIC:                          -6672.60
                                        No. Observations:                 1258
Date:                Tue, May 09 2023   Df Residuals:                     1257
Time:                        10:42:42   Df Model:                            1
                                 Mean Model                                 
===============================================================
                 coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
mu         1.5045e-03  9.713e-08  1.549e+04      0.000 [1.504e-03,1.505e-03]
                              Volatility Model                              
===============================================================
                 coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
omega      7.6495e-06  1.791e-12  4.272e+06      0.000 [7.650e-06,7.650e-06]
alpha[1]       0.1000  1.805e-02      5.541  3.004e-08   [6.463e-02,  0.135]
beta[1]        0.8800  1.551e-02     56.729      0.000     [  0.850,  0.910]
===============================================================

The following table shows the results of fitting various EGARCH models to the MSFT stock price data.

Overall, the models indicate that the volatility of stock returns is persistent, with all models showing significant positive values for the alpha parameters. Moreover, the models suggest that the volatility of stock returns responds asymmetrically to changes in returns, with negative shocks having a more significant impact than positive shocks. This is highlighted by the negative values of the omega parameters in all three models. In finance, the omega parameter represents the risk in the market that is unrelated to the past volatility of the asset being studied. It signifies the inherent uncertainty or randomness in the system that cannot be explained by any of the past information used in the model.

ModelLog LikelihoodAICBIC
EGARCH(1,1)3355.44-6702.88-6682.33
EGARCH(1,2)3356.18-6702.36-6676.67
EGARCH(2,1)3356.67-6703.34-6677.66
EGARCH(2,2)3356.67-6701.34-6670.52

Based on the information criteria, the EGARCH(2,2) model has the lowest AIC and BIC values, making it the final model of choice.

Model Diagnostics

After selecting the final model, we need to perform diagnostic checks to ensure that the model is appropriate. The following plots show the diagnostic checks for the EGARCH(2,2) model.

# Residuals plot
plt.plot(egarch22_fit.resid)
plt.title("EGARCH(2,2) Residuals")
plt.show()
# ACF/PACF of residuals
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
plot_acf(egarch22_fit.resid, ax=axes[0])
plot_pacf(egarch22_fit.resid, ax=axes[1])
plt.tight_layout()
plt.show()

From the residual plot, we can see that the residuals of the model are approximately normally distributed and have constant variance over time. Additionally, the ACF and PACF plots of the residuals show no significant autocorrelation, indicating that the model has captured all the relevant information in the data.

Forecasting

The EGARCH(2,2) model provides a volatility fit for the MSFT stock price. Notably, there were spikes in volatility around the start of COVID in 2020 and during the Fed’s interest rate increase in 2022.

# Plot conditional volatility
plt.plot(egarch22_fit.conditional_volatility)
plt.xlabel("time")
plt.title("Conditional Volatility")
plt.show()

Finally, let’s use the EGARCH(2,2) model to forecast the volatility of the MSFT stock price for the next day.

# Last 5 days of volatility
egarch22_fit.conditional_volatility[-5:]
Date
2023-05-08 00:00:00-04:00    0.017646
2023-05-09 00:00:00-04:00    0.016606
2023-05-10 00:00:00-04:00    0.015939
2023-05-11 00:00:00-04:00    0.016860
2023-05-12 00:00:00-04:00    0.016005
# Forecast next day
forecasts = egarch22_fit.forecast(reindex=False)
print("Forecasting Mean variance")
print(egarch22_forecast.mean.iloc[-3:])
print("Forecasting Residual variance")
print(forecasts.residual_variance.iloc[-3:])
Forecasting Mean variance
                                h.1
Date                               
2023-05-09 00:00:00-04:00  0.001541

Based on the model, the forecasted volatility for the next day is 0.001541. This value suggests that the average volatility will decrease compared to the last five days. However, the accuracy of this prediction remains uncertain. To assess the model’s accuracy, a rolling prediction approach can be used and compared against actual values using a measure like RMSE. Further analysis will be explored in the subsequent parts of this series.

References

2 ways to plot the confidence interval of a best fit regression line using R and python

When analyzing data, it is often useful to fit a regression line to model the relationship between two variables. However, it is also important to understand the uncertainty associated with the line of best fit. One way to display this uncertainty is by plotting the confidence interval about the regression line. In this document, we will discuss two methods for plotting the confidence interval about a best fit regression line using R and Python. Finally, we decide on when to use which one.

Method 1: Using R + ggplot2

R is a popular open-source programming language for statistical computing and graphics. To plot the confidence interval about a best fit regression line in R, we can use the ggplot2 package. Here are the steps to do so:

Load the necessary libraries:

library(ggplot2)

Generate some data

> a=c(1:10)
> b=5*a+5*rnorm(10)
> df=data.frame(a,b)
> df
    a         b
1   1  5.253065
2   2 18.189419
3   3 15.137868
4   4 20.399989
5   5 27.297348
6   6 27.935176
7   7 29.603539
8   8 34.692199
9   9 38.631428
10 10 57.167884

Create a scatter plot with ggplot() and specify the data and variables. The mapping is necessary to let ggplot know that we want to plot the column “a” along the x-axis and the column “b” along the y-axis.

ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18)

Add the regression line with geom_smooth(method="lm"):

ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm")

The confidence interval is automatically added. In case it isn’t add the following to the plot: ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm") + geom_ribbon(aes(ymin=ci[,2], ymax=ci[,3]), alpha=0.2) . The whole code looks like this:

ci=predict(lm.fit, newdata = df['a'], interval = "confidence")
fit        lwr      upr
1   7.348585  0.5342818 14.16289
2  11.811297  6.0319880 17.59061
3  16.274010 11.4134868 21.13453
4  20.736723 16.6005974 24.87285
5  25.199435 21.4780169 28.92085
6  29.662148 25.9407294 33.38357
7  34.124860 29.9887351 38.26099
8  38.587573 33.7270495 43.44810
9  43.050285 37.2709758 48.82959
10 47.512998 40.6986947 54.32730
ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm") + geom_ribbon(aes(ymin=ci[,2], ymax=ci[,3]), alpha=0.2)

ymin and ymax are the lower and upper bounds of the confidence interval. The alpha parameter adjusts the transparency of the ribbon.

Method 2: Python + seaborn

Python is another popular programming language for data analysis and visualization. To plot the confidence interval about a best fit regression line in Python, we can use the seaborn package. Here are the steps to do so:

Load the necessary libraries:

import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")

Generate data

a = np.arange(10)
b = 5*a + 5*np.random.rand(10)
df = pd.DataFrame({'a':a, 'b':b})

Create a scatter plot with sns.scatterplot() and specify the data and variables:

_ = sns.scatterplot(data=df, x="a", y="b")

Add the regression line with sns.regplot():

_ = sns.regplot(data=df, x="a", y="b")

Finally, add the confidence interval with sns.regplot(ci=95):

_ = sns.regplot(data=df, x="a", y="b", ci=95)

The ci parameter specifies the confidence interval level in percentage.

Verdict

We used the ggplot2 package in R and the seaborn package in Python to generate the confidence interval plots. The ggplot2 result definitely looks more professional quality while the seaborn was much faster to code. We can choose the method that fits our needs. If we want to publish our graphs in journals then ggplot2 might be a better choice (not always). If we want to do a quick presentation then I will prefer seaborn.

R Studio installation issue workaround for Ubuntu 22

If you are facing issues with installing R Studio on Ubuntu 22, and getting this error

Depends: libicu70 (>= 70.1-1~) but it is not installable

Here is a simple workaround that you can follow.

  1. First, follow the steps as outlined in the official website of R Studio

https://cran.rstudio.com/

Then download the ICU v70 from official Ubuntu website.

https://packages.ubuntu.com/jammy/libicu70

  1. Download the package and install using dpkg
sudo dpkg -i libicu70_70.1-2_amd64.deb
  1. After the installation is complete, run R Studio from the applications menu or by typing rstudio in the terminal.

This workaround should help you install R Studio on Ubuntu 22 without any issues. If you face any problems, feel free to consult the official R Studio documentation or reach out to the community for help.

The Role of p-Value in Hypothesis Testing and Machine Learning

Introduction

In statistical hypothesis testing, the p-value is a widely-used tool to evaluate the significance of results obtained from a sample as they relate to a population. In machine learning, hypothesis testing is employed to validate models and determine whether the results obtained from the models are statistically significant.

Example using the Normal Distribution

Suppose we have a dataset of the heights of 1000 people, and we want to test whether the average height of the population is greater than 170 cm. We can use the normal distribution to model the distribution of heights in the population.

We can use the null hypothesis that the average height is 170 cm, and the alternative hypothesis that the average height is greater than 170 cm.

We take a sample of 50 people from the population and calculate the sample mean and standard deviation. We then calculate the test statistic:

z = (sample_mean - hypothesized_mean) / (standard_error)

where the standard error is given by:

standard_error = standard_deviation / sqrt(sample_size)

If the null hypothesis is true, then the test statistic follows a standard normal distribution. We can calculate the p-value as the area under the tail of the distribution corresponding to the observed test statistic.

If the p-value is less than the significance level (usually 0.05), then we reject the null hypothesis and conclude that the average height is greater than 170 cm. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to support the alternative hypothesis.

Uses in Machine Learning

In machine learning, the p-value is used to evaluate the performance of a model. For instance, a machine learning algorithm could be used to predict whether a customer will buy a product or not. The p-value would then be used to determine whether the results obtained from the algorithm are statistically significant, i.e., whether the algorithm is better than random guessing.

The p-value is also used in feature selection, which is the process of selecting relevant features that contribute to the prediction of the target variable. Features with a low p-value are considered to be statistically significant and are usually selected for use in the model.

Example of p-value in ML

In R, the lm() function is used to fit linear regression models. After fitting a model using lm(), the summary() function can be used to obtain a summary of the model, which includes the p-values for the coefficients.

For example, suppose we have a dataset of the weights and heights of 100 people. We want to fit a linear regression model to predict weight from height. We can use the following R code to fit the model:

Suppose we have a dataset of the weights and heights of 100 people. We want to fit a linear regression model to predict weight from height. We can use the following R code to fit the model:

model <- lm(weight ~ height, data = mydata)
summary(model)

The output of the summary() function would be:

Call:
lm(formula = weight ~ height, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max
-6.2500 -1.4375  0.2812  1.6250  5.8125

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -103.285      5.181 -19.924   <2e-16 ***
height         3.451      0.079  43.545   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.424 on 98 degrees of freedom
Multiple R-squared:  0.8676,	Adjusted R-squared:  0.8663
F-statistic:  1895 on 1 and 98 DF,  p-value: < 2.2e-16

The table of coefficients includes the estimated coefficients for the intercept and the height variable, as well as their standard errors, t-values, and p-values. The p-value for the height variable is less than 0.05, indicating that the height variable is statistically significant and contributes to the prediction of the target variable (weight).

Limitations of p-value

It is important to note that the p-value is not without its limitations. One of the main limitations is that it can be influenced by the sample size and the significance level chosen by the researcher. Additionally, the p-value does not provide information about the effect size or the practical significance of the results obtained from a study. Therefore, it is important to interpret the p-value in conjunction with other statistical measures, such as effect size and confidence intervals, to obtain a more complete understanding of the results obtained from a study.

Once common issue is that we need to know which tailed test we use based on the context. The lm() function always outputs the two-tailed p-value (t-value). However, some cases as described below is really one-tailed.

For example, let’s say we are interested in studying the effect of a new drug on reducing anxiety levels in patients. We have a hypothesis that the new drug will decrease anxiety levels in patients, and we are not interested in the possibility that the drug may increase anxiety levels. In this case, we would use a one-tailed test with the alternative hypothesis stating that the true mean difference in anxiety levels between the treatment and control groups is less than zero (i.e., the drug reduces anxiety levels).

mydata <- data.frame(anxiety_reduction = c(2, 4, 3, 5, 1, 6, 7, 
                        8, 9, 10, 12, 13, 14, 15, 11),
                      new_drug = c(0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0))

model <- lm(anxiety_reduction ~ new_drug, data = mydata)
summary(model)

Call:
lm(formula = anxiety_reduction ~ new_drug, data = mydata)

Residuals:
      Min        1Q    Median        3Q       Max
-2.041667 -0.916667 -0.041667  0.958333  2.458333

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.5000     0.8021   3.114    0.009 **
new_drug      1.4167     0.8021   1.766    0.099 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.386 on 13 degrees of freedom
Multiple R-squared:  0.1792,	Adjusted R-squared:  0.09458
F-statistic: 3.131 on 1 and 13 DF,  p-value: 0.09852

It looks like the above is not statistically significant since p-value>0.05. However, the test is left-tailed since we want to know if there is any decrease in symptoms. The p-value in the output is for two-tailed. Therefore, the left-tailed p-value would be half of that: 0.09852/2 = 0.049 < 0.05, making the effect of the drug statistically significant.

Conclusion

In conclusion, the p-value is a valuable tool in hypothesis testing and machine learning. It is used to evaluate the statistical significance of results obtained from models and experiments. However, it is important to interpret the p-value in conjunction with other statistical measures and to be aware of its limitations to obtain a complete understanding of the results obtained from a study.

Application of Linear Regression with R – Predicting Final Grades based on GPA

Application of Linear Regression with R

Predicting Final Grades based on GPA

In the problem below, our objective is to identify if there exists a cutoff GPA such that if a student’s’ GPA is below that threshold they tend to fail the class. This allows to identify at-risk students early in the course. Since the GPA is collected at the beginning of the course and the final grades are obtained at the end of the course, it implies that the GPA should be the predictor variable even though we want to identify a cutoff GPA. Therefore, the model we will fit is

final grade ~ GPA

Let us assume that the final grade is linearly related to the students’ GPA. We can check for linearity in a later section. 

> df = read.csv("gpa.csv")

The imported data looks like this. Names are withheld.

Student CurrentGPA TotalAbsent NumericGrade
***ia 3 15.04 89
***lheid 3.67 21.52 83
***ue 1.2 31.63 17
***tha 2.5 24.56 71
***raina 2 54.04 15
***ia 2.8 13.93 72
***issy 3.33 24.33 79
***ob 2.78 8.15 80
***uel 2.67 10.56 73
***hael 3.1 8.26 79

If we plot the quantile values 

> boxplot(df$NumericGrade)

it would look like this

a1

and if we plot GPA vs FinalGrade it will look like this

a2

We fit the linear model

> gpafit = lm(NumericGrade ~ CurrentGPA, data=df)

Ideally, we should do a test-train split and fit the training data and repeat till we reduce the bias. However, we are going for efficiency her and not accuracy. We plot the fitted line (red line) on the scatter plot of GPA vs FinalGrade.

> abline(gpafit, col="red")

a4

We want to set a cutoff for the passing grade of 70 and see what is the cutoff GPA for getting 70 and above. Now, we have obtained FinalGrade as a function of GPA and not the other way around.  To predict the GPA (x-value) given FinalGrade (y-value) we need to use the approx() function.

> approx(gpafit$fitted.values, df$CurrentGPA, 70)
$x
[1] 70

$y
[1] 2.680097

From the output we see that when x=70, y=2.68 where x is now the FinalGrade and y is the GPA. approx() takes in a list of x-values and the corresponding y-values as tries a linear interpolation of x_pred to get y_pred = approx(x,y,x_pred). We draw two lines, one at FinalGrade = 70 and one at GPA=2.68 to show our cutoffs.

> plot(df$CurrentGPA, df$NumericGrade, col="blue", 
      xlab = "GPA", ylab="FinalGrade")
> abline(gpafit, col="red")
> abline(v=2.68)
> abline(h=70)

a5

It looks like any student with a GPA>2.68 will tend to pass the course. There are some students who passed the course in spite of having GPA<2.68. But all students with GPA>2.68 passed except one. We can even identify the students by doing a python -like slicing of the dataframe.

> df[df$NumericGrade<70 & df$CurrentGPA>2.68,]
   Student CurrentGPA TotalAbsent NumericGrade
59 ***iola          3       28.04           46

We will run the regression analysis with all the predictor variables including student absenteeism in a later post.

Multiple linear regression with R

We demonstrate multiple linear regression (MLR) with R using a case study. Here we perform an “Analysis of grades for an algebra class”. In this case study, we use multiple analytical tools including multiple linear regression, model selection with p-values, and model verification with a test-train split of data.

Analysis of grades for an algebra class

Objective

Identify predictors of attrition and evaluate methods to measure the difficulty levels of exams

Measuring exam difficulty by comparing the medians

If the student competence remains constant then exams with similar difficulty levels should have the same median score. The spread of the data, or the standard deviation, however, implies a degree of preparedness or level of understanding among the students. So the standard deviation should decrease over time as the course progresses. We plot the data for MH140 Fall 2019 class and observe the trend that is predicted by logic. The data demonstrate that the tests are of the same difficulty level.

This figure demonstrates that ideally, the median score will remain the same across exams while the standard deviation will decrease over time.

R code

The preparation of data requires some effort as the LMS(Learning Management Software) exports all data. We have to select only the relevant columns of data, which are the weighted totals of the grading elements. After extracting only the grading element totals, we compile them into an excel spreadsheet and export it to CSV format, which is then imported into R using the following code.

> df = read.csv("fall2019.csv")


Then a boxplot of quiztotal, midterm, final, and Grade is plotted to observe the medians and the standard deviation of the score.

> boxplot(df$quiztotal, df$midterm, df$final, df$Grade, 
+ names = c("QuizTotal", "Midterm", "Final", "OverallGrade" ) , 
+ col ="deepskyblue2", ylab="Grades (%)", 
+ main = "Comparing the medians of grading elements")

names – labels each boxplot. The names have to be provided as a vector with c(“label1”, “label2”, ..)
ylab      – labels y-axis
main     – boxplot title
col        – color 

Identifying predictors of attrition

We analyze the gradebook data to identify which grading elements can be the best predictors of failure rates in a class. For this, we clean the data and prepare it for analysis for statistical software. Then we performed multiple linear regression on the data using the variable Grade (which is the final overall grade that the student receives in the class) as the response variable. The regression is performed against the following variables. 

HWtotal  q1 q2 q3  q4 q5 q6 q7  q8 q9 q10 q11 quiztotal midterm final 

Results of fitting linear regression models

We can fit against all predictor variables by manually typing in the predictor variables, like this.

> lm(Grade ~ HWtotal + quiztotal + midterm + final, data=df)

However, instead of fitting the models manually we will first iterate over all possible models.

> lapply(colnames(df)[2:17], 
+ function(x) summary(lm(paste("Grade", " ~ ", x),
+ data = df))$coefficients)

The lapply() function extracts the column names, which are the predictor variables, and iteratively sends them to the function fittingly called function(x). Inside the function, function(x), the paste function generates the linear model (e.g. Grade ~ midterm) by pasting the predictor variables passed to it by the lapply() function. The lm() function then fits the model, and then the summary() extracts the coefficients. Here are the results for all possible linear models with one predictor variable. There are two methods of model selection – the p-value method and the adjusted R-squared method. We are going to use the p-value method since it is simpler to implement.

Estimate p-values
(Intercept) 22.2454211 0.39753564
HWtotal 0.6031689 0.03585739
(Intercept) 74.23412354 9.94E-13
q1 0.09184575 2.06E-01
(Intercept) 68.6311475 1.70E-08
q2 0.1357377 1.40E-01
(Intercept) 70.0885609 3.71E-12
q3 0.1258918 4.05E-02
(Intercept) 69.6827466 1.20E-17
q4 0.1520878 1.17E-04
(Intercept) 75.57848273 8.69E-18
q5 0.08175822 4.98E-02
(Intercept) 77.30462776 1.74E-15
q6 0.04201964 3.89E-01
(Intercept) 61.8194946 6.49E-11
q7 0.2545848 1.08E-03
(Intercept) 75.08851214 1.36E-11
q8 0.06235148 3.57E-01
(Intercept) 66.2838735 1.03E-09
q9 0.1817256 3.35E-02
(Intercept) 74.42307692 1.57E-14
q10 0.07730769 1.20E-01
(Intercept) 78.34591195 3.37E-18
q11 0.03647799 4.06E-01
(Intercept) 47.6436925 1.33E-08
quiztotal 0.4686029 3.42E-06
(Intercept) 49.0215606 9.64E-08
midterm 0.3968172 4.26E-05
(Intercept) 45.6114161 0.0004473079
final 0.4469315 0.0045930408

From the results, it is clear the quiz4, midterm, and quiztotal are the best predictors of the final grades. 

Interpretation of results

The following regression models were chosen for their low p-values. 

Model p-values
Grade ~ midterm 4.26E-05
Grade ~ q4 1.17E-04
Grade ~ HWtotal 0.03586
Grade ~ HWtotal + quiztotal + midterm + final 2.20E-16

Quiztotal is replaced with q4 instead, although quiztotal with the lowest p-value has the highest predictive power. The logic is that, since quiztotal is an aggregate of 11 quizzes, regression against it is really a regression with 11 variables with 10 constraints (the relative coefficients are fixed). We see that the model Grade ~ HWtotal + quiztotal + midterm + final has the best predictive power. However, we only have access to the final grades at the very end of the semester so it is of little value at the beginning of the semester when predicting student success can only be based on a few weeks of data.

The p-values signify that the homework is the worst predictor of the final overall grade while the q4 is the best predictor. This is known to the instructor for some time from experience. The data confirms it statistically. Since the students can use external assistance for their homework assignments, the variable HWtotal has no impact on the final grades. Quiz 4 consists of solving linear equations. Students who are unable to learn the simple techniques of solving a linear equation, probably do not have the skills or aptitude to learn the more advanced topics in the class. This implies that the student failure rate can be accurately predicted by week 5.

Let us put that theory to test using a test-train stress test to the model. Since the dataset size is so small it is necessary to use bootstrap methods. However, for simplicity, we are going, to begin with, a simple test-train split of the data. In a test-train split, the data is randomly sorted into a training set and a test set according to a predetermined ratio called the test-train split ratio. The models are then trained on the training set and then tested on the test set to see which model provides the best fit. We will be fitting linear models only at the beginning, so bias-variance tradeoff does not factor in yet. Later, however, we will generalize to higher dimensional models and explore the bias-variance tradeoff.

Linear regression using test-train split

We randomly select 80% of the data into a training sample and the remaining 20% is reserved as test data. We train our models on the training samples (which were randomly selected) using the model Grade ~ Quiz4grades which is a linear regression model. 

> gradefit = lm(Grade ~ q4, data = df, subset = trainset)
Estimate Std. Error t -value p-values
(Intercept) 70.5059 2.59796 27.139 3.64E-14
q4 0.15557 0.03332 4.669 0.000303

Test the model on the test data set

> predict(gradefit,df)[-trainset]
Student # 3 6 7 17 20
Predicted Grade 86.06334 80.92938 75.63985 75.63985 86.06334
Actual Grade 81 68 74 74 84
Error -5.06334 -12.92938 -1.63985 -1.63985 -2.06334

The MSE (mean squared error) on the test data set is MSE ~ 40. The errors are distributed according to the t-distribution. Therefore, the confidence interval for the errors at the 95% CI can be calculated using the t-distribution and they are (-3.4, +1.3).

> mean(((df$Grade - predict(gradefit,df))[-trainset])^2)
40.48838
> errs = (df$Grade - predict(gradefit,df))
> mean(errs)
-1.060716
> sd(errs)
5.19508
> mean(errs) + qt(0.975, df=16)*sd(errs)/sqrt(22)
1.28728
> mean(errs) - qt(0.975, df=16)*sd(errs)/sqrt(22)
-3.408713


This means that every prediction the model makes is accurate within -3.4 and 1.3 of the actual grade. The model can be written mathematically as follows.

Grade = 0.156*Quiz4grade + 70.5

This is remarkable since 70 is the passing grade. We see that Quiz 4 performance directly impacts the pass-fail rate. What we have really accomplished is that we have identified the predictor of the attrition rate for this class.

Observation

The models trained on available data works really well in the same class of students. This means if the first 4 weeks of data are available on a group of students, their future performance can be predicted with 95% accuracy within 3.4 points of the actual final grade.

Multiple Linear Regression (MLR)

Read data

> cogn = read.csv("http://bit.ly/dasi_cognitive")
> head(cogn)

kid_score mom_hs mom_iq mom_work mom_age
65 yes 121.11753 yes 27
98 yes 89.36188 yes 25
85 yes 115.44316 yes 27
83 yes 99.44964 yes 25
115 yes 92.74571 yes 27
98 no 107.90184 no 18

 

 

Analysis of data

Here we are trying to predict kid’s test scores using their mother’s IQ, high school degree, work status, and age. Only a few of there predictor variables have a substantial impact on the kid_scores. As we shall see soon that we can improve the fit (by reducing the Adjusted R-squared) by eliminating a few of these variables. To understand the baseline result we begin by testing against the full model.

> cgn.fit = lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age ,
 data = cogn)
> summary(cgn.fit)
Call:
lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age, 
    data = cogn)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.045 -12.918   1.992  11.563  49.267 

Coefficients:
Estimate Std. Error t -value Pr(>|t|)
(Intercept) 19.59241 9.21906 2.125 0.0341
mom_hsyes 5.09482 2.3145 2.201 0.0282
mom_iq 0.56147 0.06064 9.259 <2e-16
mom_workyes 2.53718 2.35067 1.079 0.281
mom_age 0.21802 0.33074 0.659 0.5101
Residual standard error: 18.14 on 429 degrees of freedom
Multiple R-squared:  0.2171,	Adjusted R-squared:  0.2098 
F-statistic: 29.74 on 4 and 429 DF,  p-value: < 2.2e-16

We can see that the variables “mom_workyes” and “mom_age” have high p-values.

We start by fitting simple linear regression models with only one predictor variable. First, create a list of the predictor variables to iterate over.

> cols = colnames(cogn)[!(colnames(cogn) %in% c("kid_score"))]
> cols
[1] "mom_hs"   "mom_iq"   "mom_work" "mom_age" 

Fitting kids_score against each predictor variable in the list (“mom_hs” “mom_iq” “mom_work” “mom_age” ) we get the following adjusted R-squared values.

> for (c in cols){
+ adjr = summary(lm(paste("kid_score", "~", c), data=cogn))$adj.r.square
+ print(c(c,adjr))
+ }
[1] "mom_hs"             "0.0539445105919029"
[1] "mom_iq"            "0.199101580842152"
[1] "mom_work"            "0.00965521339400432"
[1] "mom_age"             "0.00616844313235732"

 

The adjusted R-square values demonstrate that the mother’s IQ would be the best predictor of high school scores.

Fitting all possible combinations is a lot of work (See [1]). We would rather use Python to perform those tasks. I would write a separate blog post to perform the same analysis using python.

We can, however, analyze a few of the models manually. We can perform MLR on models by removing one predictor variable at a time [2].

 

References:

  1. ryouready.wordpress.com/2009/02/06/r-calculating-all-possible-linear-regression-models-for-a-given-set-of-predictors
  2. www.coursera.org/learn/linear-regression-model

 

 

Barplots in R

Basic barplot

To draw the basic barplots we use the in-built data table mtcars.

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If we wish to plot the frequency of gears found in the database we first need to create a frequency table by aggregating all the unique values of the “gear” column. table() helps aggregate unique values and their frequencies.

> tmp=table(mtcars$gear)
3 4 5
15 12 5
> barplot(tmp)

Grouped barplot

The grouped barplot takes more effort.

> tmp=table(mtcars$cyl, mtcars$gear)

3 4 5
4 1 8 2
6 2 4 1
8 12 0 2

> barplot(tmp,beside = T, legend = paste(rownames(tmp),"cylinders"), 
  col = c(1:length(rownames(tmp))), xlab = "Gears/cylinders", 
  ylab = "Count")

rownames(tmp) extracts unique values of the column “cyl”(cylinders). col cycles through colors. Legend takes the rownames(tmp) as an argument but we want to append the string cylinders so that it is clear what the numbers signify. That is achieved by the paste() command which “pastes” the string “cylinders” at the end.

References:
https://www.statmethods.net/graphs/bar.html

Plotting multiple graphs in R

There are two ways to plotting multiple graphs in R.

  1. Create a grid of separate graphs on the same plot
  2. Overlay all graphs in a single frame of one plot

Create a grid of separate graphs on the same plot

A grid in R is created using the par() command. An example where we plot 4 graphs in a single plot.

> x=seq(-4,4,.1)
> par(mfrow=c(2,2))
> plot(x,dt(x,3),'l',col='red')
> plot(x,dt(x,5),'l',col='blue')
> plot(x,dt(x,10),'l',col='black')
> plot(x,dnorm(x),'l',col='magenta')

We generate a vector of values between (-4,+4) called ‘x’ and then feed that into the point distribution function (pdf) of Students t-distribution – df(z,df) – which accepts the z-score and the degrees of freedom (df) to generate the pdf. We plot 3 graphs with df = 3, 5, 10. And then we plot a pdf of the Normal Distribution – pnorm(z). Students t-distribution makes more sense with smaller sample sizes and probably with stock prices which tend to have extreme events.

The 2×2 plots – 4 graphs in 2 rows is made possible by the command par(mfrow=c(2,2)). For a 2×1 matrix of plots, we would have used par(mfrow=(c2,1)).

Rplot2

The graphs all quite different since the Students t-distribution has fatter tails than the Normal Distribution. However, that is not very apparent from the side-by-side graphs. We will need to overlay the graphs on top of each other to see the tiny differences in the graphs.

To return to regular single plot grid use

> par(mfrow=c(1,1))

Overlay all graphs in a single frame of one plot

To overlay multiple plots on each other we start by plotting the first graph with all its axes and labels. The only restriction is that we will need to limit the y-interval between specific values (if x is the independent variable, if y is the independent variable then will need to restrict x-interval). ylim=c(0,0.4) restricts the plot between the y-interval (0,0.4). 

> par(mfrow=c(1,1))
> plot(x,dnorm(x),'l',col=1,ylim=c(0,0.4))

Rplot1

Now, we overlay the rest of the plots over this one by looping over pdf of the Students t-distribution with df = 3,5,10 (which we plotted in the previous section).

> for (val in c(3,5,10)){
+     par(new=T)
+     plot(x,dt(x,val),'l',col=val,axes=F,xlab="",ylab="")
+ }

The overlay is achieved using the par(new=T) command (same as par(new=TRUE)). It tells R that we are plotting a new graph on the same space. We also needed to turn off the axes (axes=F) and the labels for the x-axes and the y-axes (xlab=””,ylab=””). Otherwise, the text of different graphs will overlay on each other and make it look messy. In the resulting graph, we can see the fat tails of the Students t-distribution.Rplot4

Convert all data in a table to a numeric data type in R

When tables are imported from CSV files they are usually imported as data.frames and the values are characters and not numeric. Hence it is impossible to use the table unless the data is converted to a numeric data type.

The conversion of a single variable to the numeric data type in R involves passing the variable to the function as.numeric(). 

var_in_numeric_dtype = as.numeric(var_in_char_dtype)

 

For tables (matrices), the as.numeric() function has to be recursively applied using the apply() function (I seem to have more control with apply() than sapply()).

> alg2 = apply(alg[,c(1:4)],c(1,2),as.numeric)
            Mean grade   Std Total students # of fails
X2016S1          77.00 14.00              5          1
X2016S2          74.00 14.00             11          3
X2016S3          85.00 12.00             20          1
X2017S1          72.00 21.00             22          5
X2017S2          57.45 38.28              7          3
X2018S1          73.91 21.56             17          3
X2018S2          83.20  6.62              9          0
X2018S3          69.98 22.44             14          4
Spring.2019      69.63 28.62             19          2


alg2 is a matrix consisting of numerical values on which we can perform mathematical operations. We can recalculate the column “fail rate” (which we had to drop as it was not numeric) now using columns 3 ( “# of fails”) and  4 (“Total students”).

> failrate = alg2[,4]/alg2[,3]*100
    X2016S1     X2016S2     X2016S3     X2017S1     X2017S2     X2018S1     X2018S2     X2018S3 Spring.2019 
   20.00000    27.27273     5.00000    22.72727    42.85714    17.64706     0.00000    28.57143    10.52632 


> alg = cbind(alg2, failrate)
            Mean grade   Std Total students # of fails failrate
X2016S1          77.00 14.00              5          1 20.00000
X2016S2          74.00 14.00             11          3 27.27273
X2016S3          85.00 12.00             20          1  5.00000
X2017S1          72.00 21.00             22          5 22.72727
X2017S2          57.45 38.28              7          3 42.85714
X2018S1          73.91 21.56             17          3 17.64706
X2018S2          83.20  6.62              9          0  0.00000
X2018S3          69.98 22.44             14          4 28.57143
Spring.2019      69.63 28.62             19          2 10.52632