Multiple Linear Regression (MLR)

Read data

> cogn = read.csv("http://bit.ly/dasi_cognitive")
> head(cogn)

kid_score mom_hs mom_iq mom_work mom_age
65 yes 121.11753 yes 27
98 yes 89.36188 yes 25
85 yes 115.44316 yes 27
83 yes 99.44964 yes 25
115 yes 92.74571 yes 27
98 no 107.90184 no 18

 

 

Analysis of data

Here we are trying to predict kid’s test scores using their mother’s IQ, high school degree, work status, and age. Only a few of there predictor variables have a substantial impact on the kid_scores. As we shall see soon that we can improve the fit (by reducing the Adjusted R-squared) by eliminating a few of these variables. To understand the baseline result we begin by testing against the full model.

> cgn.fit = lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age ,
 data = cogn)
> summary(cgn.fit)
Call:
lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age, 
    data = cogn)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.045 -12.918   1.992  11.563  49.267 

Coefficients:
Estimate Std. Error t -value Pr(>|t|)
(Intercept) 19.59241 9.21906 2.125 0.0341
mom_hsyes 5.09482 2.3145 2.201 0.0282
mom_iq 0.56147 0.06064 9.259 <2e-16
mom_workyes 2.53718 2.35067 1.079 0.281
mom_age 0.21802 0.33074 0.659 0.5101
Residual standard error: 18.14 on 429 degrees of freedom
Multiple R-squared:  0.2171,	Adjusted R-squared:  0.2098 
F-statistic: 29.74 on 4 and 429 DF,  p-value: < 2.2e-16

We can see that the variables “mom_workyes” and “mom_age” have high p-values.

We start by fitting simple linear regression models with only one predictor variable. First, create a list of the predictor variables to iterate over.

> cols = colnames(cogn)[!(colnames(cogn) %in% c("kid_score"))]
> cols
[1] "mom_hs"   "mom_iq"   "mom_work" "mom_age" 

Fitting kids_score against each predictor variable in the list (“mom_hs” “mom_iq” “mom_work” “mom_age” ) we get the following adjusted R-squared values.

> for (c in cols){
+ adjr = summary(lm(paste("kid_score", "~", c), data=cogn))$adj.r.square
+ print(c(c,adjr))
+ }
[1] "mom_hs"             "0.0539445105919029"
[1] "mom_iq"            "0.199101580842152"
[1] "mom_work"            "0.00965521339400432"
[1] "mom_age"             "0.00616844313235732"

 

The adjusted R-square values demonstrate that the mother’s IQ would be the best predictor of high school scores.

Fitting all possible combinations is a lot of work (See [1]). We would rather use Python to perform those tasks. I would write a separate blog post to perform the same analysis using python.

We can, however, analyze a few of the models manually. We can perform MLR on models by removing one predictor variable at a time [2].

 

References:

  1. ryouready.wordpress.com/2009/02/06/r-calculating-all-possible-linear-regression-models-for-a-given-set-of-predictors
  2. www.coursera.org/learn/linear-regression-model

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s