Heteroscedasticity Unmasked: Taming Increasing Variance with Transformations!

  1. What is Heteroscedasticity? Heteroscedasticity refers to the uneven spread or varying levels of dispersion in data points, meaning that the variability of data isn’t consistent across the range.
  2. Why is it Bad for Modeling and Prediction? Heteroscedasticity can wreak havoc on modeling and prediction because it violates the assumption of constant variance in many statistical techniques, leading to biased results and unreliable predictions.
  3. How to Handle It To tackle Heteroscedasticity, one effective approach is data transformation. This involves altering the data using mathematical functions to stabilize variance and make your models more robust.

The Data: Let’s start with some code and synthetic data for house prices, where the variability increases as house sizes grow. This mirrors real-world scenarios where larger properties often exhibit more price variability.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial import Polynomial

# Step 1: Generate Synthetic Data with Heteroscedasticity
np.random.seed(0)  # For reproducibility
house_sizes = np.linspace(1000, 5000, 100)  # House sizes
true_prices = 50000 + 30 * house_sizes  # True house prices

# Introduce heteroscedasticity - variance increases with house size
error_term = np.random.normal(np.zeros(100), 7 * house_sizes)
noisy_prices = true_prices + error_term

# Split error_term into positive and negative components
positive_errors = np.where(error_term > 0, error_term, 0)
negative_errors = np.where(error_term < 0, error_term, 0)

Step 2: Now, let’s plot the bounding lines, illustrating the increasing variance.

# Fit polynomials to the positive and negative errors
positive_poly = Polynomial.fit(house_sizes, positive_errors, deg=1)
negative_poly = Polynomial.fit(house_sizes, negative_errors, deg=1)

# Calculate values for the bounding lines
positive_bounds = positive_poly(house_sizes)
negative_bounds = negative_poly(house_sizes)

Step 3: To stabilize the data, we’ll apply a logarithmic transformation to the noisy prices. This transformation makes the data more suitable for analysis.

# Step 3: Apply a Logarithmic Transformation
data['Log_Price'] = np.log(data['Price'])

Result: The transformation effectively reduces the increasing variance. We can visualize the change through comparison plots, showing the data’s improved stability.

Conclusion: Addressing increasing variance with transformations is a valuable tool in data analysis. Whether you’re dealing with house prices or any other dataset, understanding and mitigating heteroscedasticity enhances the reliability of your analysis and decision-making.

The Role of p-Value in Hypothesis Testing and Machine Learning

Introduction

In statistical hypothesis testing, the p-value is a widely-used tool to evaluate the significance of results obtained from a sample as they relate to a population. In machine learning, hypothesis testing is employed to validate models and determine whether the results obtained from the models are statistically significant.

Example using the Normal Distribution

Suppose we have a dataset of the heights of 1000 people, and we want to test whether the average height of the population is greater than 170 cm. We can use the normal distribution to model the distribution of heights in the population.

We can use the null hypothesis that the average height is 170 cm, and the alternative hypothesis that the average height is greater than 170 cm.

We take a sample of 50 people from the population and calculate the sample mean and standard deviation. We then calculate the test statistic:

z = (sample_mean - hypothesized_mean) / (standard_error)

where the standard error is given by:

standard_error = standard_deviation / sqrt(sample_size)

If the null hypothesis is true, then the test statistic follows a standard normal distribution. We can calculate the p-value as the area under the tail of the distribution corresponding to the observed test statistic.

If the p-value is less than the significance level (usually 0.05), then we reject the null hypothesis and conclude that the average height is greater than 170 cm. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to support the alternative hypothesis.

Uses in Machine Learning

In machine learning, the p-value is used to evaluate the performance of a model. For instance, a machine learning algorithm could be used to predict whether a customer will buy a product or not. The p-value would then be used to determine whether the results obtained from the algorithm are statistically significant, i.e., whether the algorithm is better than random guessing.

The p-value is also used in feature selection, which is the process of selecting relevant features that contribute to the prediction of the target variable. Features with a low p-value are considered to be statistically significant and are usually selected for use in the model.

Example of p-value in ML

In R, the lm() function is used to fit linear regression models. After fitting a model using lm(), the summary() function can be used to obtain a summary of the model, which includes the p-values for the coefficients.

For example, suppose we have a dataset of the weights and heights of 100 people. We want to fit a linear regression model to predict weight from height. We can use the following R code to fit the model:

Suppose we have a dataset of the weights and heights of 100 people. We want to fit a linear regression model to predict weight from height. We can use the following R code to fit the model:

model <- lm(weight ~ height, data = mydata)
summary(model)

The output of the summary() function would be:

Call:
lm(formula = weight ~ height, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max
-6.2500 -1.4375  0.2812  1.6250  5.8125

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -103.285      5.181 -19.924   <2e-16 ***
height         3.451      0.079  43.545   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.424 on 98 degrees of freedom
Multiple R-squared:  0.8676,	Adjusted R-squared:  0.8663
F-statistic:  1895 on 1 and 98 DF,  p-value: < 2.2e-16

The table of coefficients includes the estimated coefficients for the intercept and the height variable, as well as their standard errors, t-values, and p-values. The p-value for the height variable is less than 0.05, indicating that the height variable is statistically significant and contributes to the prediction of the target variable (weight).

Limitations of p-value

It is important to note that the p-value is not without its limitations. One of the main limitations is that it can be influenced by the sample size and the significance level chosen by the researcher. Additionally, the p-value does not provide information about the effect size or the practical significance of the results obtained from a study. Therefore, it is important to interpret the p-value in conjunction with other statistical measures, such as effect size and confidence intervals, to obtain a more complete understanding of the results obtained from a study.

Once common issue is that we need to know which tailed test we use based on the context. The lm() function always outputs the two-tailed p-value (t-value). However, some cases as described below is really one-tailed.

For example, let’s say we are interested in studying the effect of a new drug on reducing anxiety levels in patients. We have a hypothesis that the new drug will decrease anxiety levels in patients, and we are not interested in the possibility that the drug may increase anxiety levels. In this case, we would use a one-tailed test with the alternative hypothesis stating that the true mean difference in anxiety levels between the treatment and control groups is less than zero (i.e., the drug reduces anxiety levels).

mydata <- data.frame(anxiety_reduction = c(2, 4, 3, 5, 1, 6, 7, 
                        8, 9, 10, 12, 13, 14, 15, 11),
                      new_drug = c(0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0))

model <- lm(anxiety_reduction ~ new_drug, data = mydata)
summary(model)

Call:
lm(formula = anxiety_reduction ~ new_drug, data = mydata)

Residuals:
      Min        1Q    Median        3Q       Max
-2.041667 -0.916667 -0.041667  0.958333  2.458333

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.5000     0.8021   3.114    0.009 **
new_drug      1.4167     0.8021   1.766    0.099 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.386 on 13 degrees of freedom
Multiple R-squared:  0.1792,	Adjusted R-squared:  0.09458
F-statistic: 3.131 on 1 and 13 DF,  p-value: 0.09852

It looks like the above is not statistically significant since p-value>0.05. However, the test is left-tailed since we want to know if there is any decrease in symptoms. The p-value in the output is for two-tailed. Therefore, the left-tailed p-value would be half of that: 0.09852/2 = 0.049 < 0.05, making the effect of the drug statistically significant.

Conclusion

In conclusion, the p-value is a valuable tool in hypothesis testing and machine learning. It is used to evaluate the statistical significance of results obtained from models and experiments. However, it is important to interpret the p-value in conjunction with other statistical measures and to be aware of its limitations to obtain a complete understanding of the results obtained from a study.

Application of Linear Regression with R – Predicting Final Grades based on GPA

Application of Linear Regression with R

Predicting Final Grades based on GPA

In the problem below, our objective is to identify if there exists a cutoff GPA such that if a student’s’ GPA is below that threshold they tend to fail the class. This allows to identify at-risk students early in the course. Since the GPA is collected at the beginning of the course and the final grades are obtained at the end of the course, it implies that the GPA should be the predictor variable even though we want to identify a cutoff GPA. Therefore, the model we will fit is

final grade ~ GPA

Let us assume that the final grade is linearly related to the students’ GPA. We can check for linearity in a later section. 

> df = read.csv("gpa.csv")

The imported data looks like this. Names are withheld.

Student CurrentGPA TotalAbsent NumericGrade
***ia 3 15.04 89
***lheid 3.67 21.52 83
***ue 1.2 31.63 17
***tha 2.5 24.56 71
***raina 2 54.04 15
***ia 2.8 13.93 72
***issy 3.33 24.33 79
***ob 2.78 8.15 80
***uel 2.67 10.56 73
***hael 3.1 8.26 79

If we plot the quantile values 

> boxplot(df$NumericGrade)

it would look like this

a1

and if we plot GPA vs FinalGrade it will look like this

a2

We fit the linear model

> gpafit = lm(NumericGrade ~ CurrentGPA, data=df)

Ideally, we should do a test-train split and fit the training data and repeat till we reduce the bias. However, we are going for efficiency her and not accuracy. We plot the fitted line (red line) on the scatter plot of GPA vs FinalGrade.

> abline(gpafit, col="red")

a4

We want to set a cutoff for the passing grade of 70 and see what is the cutoff GPA for getting 70 and above. Now, we have obtained FinalGrade as a function of GPA and not the other way around.  To predict the GPA (x-value) given FinalGrade (y-value) we need to use the approx() function.

> approx(gpafit$fitted.values, df$CurrentGPA, 70)
$x
[1] 70

$y
[1] 2.680097

From the output we see that when x=70, y=2.68 where x is now the FinalGrade and y is the GPA. approx() takes in a list of x-values and the corresponding y-values as tries a linear interpolation of x_pred to get y_pred = approx(x,y,x_pred). We draw two lines, one at FinalGrade = 70 and one at GPA=2.68 to show our cutoffs.

> plot(df$CurrentGPA, df$NumericGrade, col="blue", 
      xlab = "GPA", ylab="FinalGrade")
> abline(gpafit, col="red")
> abline(v=2.68)
> abline(h=70)

a5

It looks like any student with a GPA>2.68 will tend to pass the course. There are some students who passed the course in spite of having GPA<2.68. But all students with GPA>2.68 passed except one. We can even identify the students by doing a python -like slicing of the dataframe.

> df[df$NumericGrade<70 & df$CurrentGPA>2.68,]
   Student CurrentGPA TotalAbsent NumericGrade
59 ***iola          3       28.04           46

We will run the regression analysis with all the predictor variables including student absenteeism in a later post.

Multiple linear regression with R

We demonstrate multiple linear regression (MLR) with R using a case study. Here we perform an “Analysis of grades for an algebra class”. In this case study, we use multiple analytical tools including multiple linear regression, model selection with p-values, and model verification with a test-train split of data.

Analysis of grades for an algebra class

Objective

Identify predictors of attrition and evaluate methods to measure the difficulty levels of exams

Measuring exam difficulty by comparing the medians

If the student competence remains constant then exams with similar difficulty levels should have the same median score. The spread of the data, or the standard deviation, however, implies a degree of preparedness or level of understanding among the students. So the standard deviation should decrease over time as the course progresses. We plot the data for MH140 Fall 2019 class and observe the trend that is predicted by logic. The data demonstrate that the tests are of the same difficulty level.

This figure demonstrates that ideally, the median score will remain the same across exams while the standard deviation will decrease over time.

R code

The preparation of data requires some effort as the LMS(Learning Management Software) exports all data. We have to select only the relevant columns of data, which are the weighted totals of the grading elements. After extracting only the grading element totals, we compile them into an excel spreadsheet and export it to CSV format, which is then imported into R using the following code.

> df = read.csv("fall2019.csv")


Then a boxplot of quiztotal, midterm, final, and Grade is plotted to observe the medians and the standard deviation of the score.

> boxplot(df$quiztotal, df$midterm, df$final, df$Grade, 
+ names = c("QuizTotal", "Midterm", "Final", "OverallGrade" ) , 
+ col ="deepskyblue2", ylab="Grades (%)", 
+ main = "Comparing the medians of grading elements")

names – labels each boxplot. The names have to be provided as a vector with c(“label1”, “label2”, ..)
ylab      – labels y-axis
main     – boxplot title
col        – color 

Identifying predictors of attrition

We analyze the gradebook data to identify which grading elements can be the best predictors of failure rates in a class. For this, we clean the data and prepare it for analysis for statistical software. Then we performed multiple linear regression on the data using the variable Grade (which is the final overall grade that the student receives in the class) as the response variable. The regression is performed against the following variables. 

HWtotal  q1 q2 q3  q4 q5 q6 q7  q8 q9 q10 q11 quiztotal midterm final 

Results of fitting linear regression models

We can fit against all predictor variables by manually typing in the predictor variables, like this.

> lm(Grade ~ HWtotal + quiztotal + midterm + final, data=df)

However, instead of fitting the models manually we will first iterate over all possible models.

> lapply(colnames(df)[2:17], 
+ function(x) summary(lm(paste("Grade", " ~ ", x),
+ data = df))$coefficients)

The lapply() function extracts the column names, which are the predictor variables, and iteratively sends them to the function fittingly called function(x). Inside the function, function(x), the paste function generates the linear model (e.g. Grade ~ midterm) by pasting the predictor variables passed to it by the lapply() function. The lm() function then fits the model, and then the summary() extracts the coefficients. Here are the results for all possible linear models with one predictor variable. There are two methods of model selection – the p-value method and the adjusted R-squared method. We are going to use the p-value method since it is simpler to implement.

Estimate p-values
(Intercept) 22.2454211 0.39753564
HWtotal 0.6031689 0.03585739
(Intercept) 74.23412354 9.94E-13
q1 0.09184575 2.06E-01
(Intercept) 68.6311475 1.70E-08
q2 0.1357377 1.40E-01
(Intercept) 70.0885609 3.71E-12
q3 0.1258918 4.05E-02
(Intercept) 69.6827466 1.20E-17
q4 0.1520878 1.17E-04
(Intercept) 75.57848273 8.69E-18
q5 0.08175822 4.98E-02
(Intercept) 77.30462776 1.74E-15
q6 0.04201964 3.89E-01
(Intercept) 61.8194946 6.49E-11
q7 0.2545848 1.08E-03
(Intercept) 75.08851214 1.36E-11
q8 0.06235148 3.57E-01
(Intercept) 66.2838735 1.03E-09
q9 0.1817256 3.35E-02
(Intercept) 74.42307692 1.57E-14
q10 0.07730769 1.20E-01
(Intercept) 78.34591195 3.37E-18
q11 0.03647799 4.06E-01
(Intercept) 47.6436925 1.33E-08
quiztotal 0.4686029 3.42E-06
(Intercept) 49.0215606 9.64E-08
midterm 0.3968172 4.26E-05
(Intercept) 45.6114161 0.0004473079
final 0.4469315 0.0045930408

From the results, it is clear the quiz4, midterm, and quiztotal are the best predictors of the final grades. 

Interpretation of results

The following regression models were chosen for their low p-values. 

Model p-values
Grade ~ midterm 4.26E-05
Grade ~ q4 1.17E-04
Grade ~ HWtotal 0.03586
Grade ~ HWtotal + quiztotal + midterm + final 2.20E-16

Quiztotal is replaced with q4 instead, although quiztotal with the lowest p-value has the highest predictive power. The logic is that, since quiztotal is an aggregate of 11 quizzes, regression against it is really a regression with 11 variables with 10 constraints (the relative coefficients are fixed). We see that the model Grade ~ HWtotal + quiztotal + midterm + final has the best predictive power. However, we only have access to the final grades at the very end of the semester so it is of little value at the beginning of the semester when predicting student success can only be based on a few weeks of data.

The p-values signify that the homework is the worst predictor of the final overall grade while the q4 is the best predictor. This is known to the instructor for some time from experience. The data confirms it statistically. Since the students can use external assistance for their homework assignments, the variable HWtotal has no impact on the final grades. Quiz 4 consists of solving linear equations. Students who are unable to learn the simple techniques of solving a linear equation, probably do not have the skills or aptitude to learn the more advanced topics in the class. This implies that the student failure rate can be accurately predicted by week 5.

Let us put that theory to test using a test-train stress test to the model. Since the dataset size is so small it is necessary to use bootstrap methods. However, for simplicity, we are going, to begin with, a simple test-train split of the data. In a test-train split, the data is randomly sorted into a training set and a test set according to a predetermined ratio called the test-train split ratio. The models are then trained on the training set and then tested on the test set to see which model provides the best fit. We will be fitting linear models only at the beginning, so bias-variance tradeoff does not factor in yet. Later, however, we will generalize to higher dimensional models and explore the bias-variance tradeoff.

Linear regression using test-train split

We randomly select 80% of the data into a training sample and the remaining 20% is reserved as test data. We train our models on the training samples (which were randomly selected) using the model Grade ~ Quiz4grades which is a linear regression model. 

> gradefit = lm(Grade ~ q4, data = df, subset = trainset)
Estimate Std. Error t -value p-values
(Intercept) 70.5059 2.59796 27.139 3.64E-14
q4 0.15557 0.03332 4.669 0.000303

Test the model on the test data set

> predict(gradefit,df)[-trainset]
Student # 3 6 7 17 20
Predicted Grade 86.06334 80.92938 75.63985 75.63985 86.06334
Actual Grade 81 68 74 74 84
Error -5.06334 -12.92938 -1.63985 -1.63985 -2.06334

The MSE (mean squared error) on the test data set is MSE ~ 40. The errors are distributed according to the t-distribution. Therefore, the confidence interval for the errors at the 95% CI can be calculated using the t-distribution and they are (-3.4, +1.3).

> mean(((df$Grade - predict(gradefit,df))[-trainset])^2)
40.48838
> errs = (df$Grade - predict(gradefit,df))
> mean(errs)
-1.060716
> sd(errs)
5.19508
> mean(errs) + qt(0.975, df=16)*sd(errs)/sqrt(22)
1.28728
> mean(errs) - qt(0.975, df=16)*sd(errs)/sqrt(22)
-3.408713


This means that every prediction the model makes is accurate within -3.4 and 1.3 of the actual grade. The model can be written mathematically as follows.

Grade = 0.156*Quiz4grade + 70.5

This is remarkable since 70 is the passing grade. We see that Quiz 4 performance directly impacts the pass-fail rate. What we have really accomplished is that we have identified the predictor of the attrition rate for this class.

Observation

The models trained on available data works really well in the same class of students. This means if the first 4 weeks of data are available on a group of students, their future performance can be predicted with 95% accuracy within 3.4 points of the actual final grade.

Multiple Linear Regression (MLR)

Read data

> cogn = read.csv("http://bit.ly/dasi_cognitive")
> head(cogn)

kid_score mom_hs mom_iq mom_work mom_age
65 yes 121.11753 yes 27
98 yes 89.36188 yes 25
85 yes 115.44316 yes 27
83 yes 99.44964 yes 25
115 yes 92.74571 yes 27
98 no 107.90184 no 18

 

 

Analysis of data

Here we are trying to predict kid’s test scores using their mother’s IQ, high school degree, work status, and age. Only a few of there predictor variables have a substantial impact on the kid_scores. As we shall see soon that we can improve the fit (by reducing the Adjusted R-squared) by eliminating a few of these variables. To understand the baseline result we begin by testing against the full model.

> cgn.fit = lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age ,
 data = cogn)
> summary(cgn.fit)
Call:
lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age, 
    data = cogn)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.045 -12.918   1.992  11.563  49.267 

Coefficients:
Estimate Std. Error t -value Pr(>|t|)
(Intercept) 19.59241 9.21906 2.125 0.0341
mom_hsyes 5.09482 2.3145 2.201 0.0282
mom_iq 0.56147 0.06064 9.259 <2e-16
mom_workyes 2.53718 2.35067 1.079 0.281
mom_age 0.21802 0.33074 0.659 0.5101
Residual standard error: 18.14 on 429 degrees of freedom
Multiple R-squared:  0.2171,	Adjusted R-squared:  0.2098 
F-statistic: 29.74 on 4 and 429 DF,  p-value: < 2.2e-16

We can see that the variables “mom_workyes” and “mom_age” have high p-values.

We start by fitting simple linear regression models with only one predictor variable. First, create a list of the predictor variables to iterate over.

> cols = colnames(cogn)[!(colnames(cogn) %in% c("kid_score"))]
> cols
[1] "mom_hs"   "mom_iq"   "mom_work" "mom_age" 

Fitting kids_score against each predictor variable in the list (“mom_hs” “mom_iq” “mom_work” “mom_age” ) we get the following adjusted R-squared values.

> for (c in cols){
+ adjr = summary(lm(paste("kid_score", "~", c), data=cogn))$adj.r.square
+ print(c(c,adjr))
+ }
[1] "mom_hs"             "0.0539445105919029"
[1] "mom_iq"            "0.199101580842152"
[1] "mom_work"            "0.00965521339400432"
[1] "mom_age"             "0.00616844313235732"

 

The adjusted R-square values demonstrate that the mother’s IQ would be the best predictor of high school scores.

Fitting all possible combinations is a lot of work (See [1]). We would rather use Python to perform those tasks. I would write a separate blog post to perform the same analysis using python.

We can, however, analyze a few of the models manually. We can perform MLR on models by removing one predictor variable at a time [2].

 

References:

  1. ryouready.wordpress.com/2009/02/06/r-calculating-all-possible-linear-regression-models-for-a-given-set-of-predictors
  2. www.coursera.org/learn/linear-regression-model