Multiple Linear Regression (MLR)

Read data

> cogn = read.csv("http://bit.ly/dasi_cognitive")
> head(cogn)

kid_score	mom_hs	mom_iq	mom_work	mom_age
65	yes	121.11753	yes	27
98	yes	89.36188	yes	25
85	yes	115.44316	yes	27
83	yes	99.44964	yes	25
115	yes	92.74571	yes	27
98	no	107.90184	no	18

Analysis of data

Here we are trying to predict kid’s test scores using their mother’s IQ, high school degree, work status, and age. Only a few of there predictor variables have a substantial impact on the kid_scores. As we shall see soon that we can improve the fit (by reducing the Adjusted R-squared) by eliminating a few of these variables. To understand the baseline result we begin by testing against the full model.

> cgn.fit = lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age ,
 data = cogn)
> summary(cgn.fit)
Call:
lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age, 
    data = cogn)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.045 -12.918   1.992  11.563  49.267 

Coefficients:

	Estimate	Std. Error	t -value	Pr(>\|t\|)
(Intercept)	19.59241	9.21906	2.125	0.0341
mom_hsyes	5.09482	2.3145	2.201	0.0282
mom_iq	0.56147	0.06064	9.259	<2e-16
mom_workyes	2.53718	2.35067	1.079	0.281
mom_age	0.21802	0.33074	0.659	0.5101

Residual standard error: 18.14 on 429 degrees of freedom
Multiple R-squared:  0.2171,	Adjusted R-squared:  0.2098 
F-statistic: 29.74 on 4 and 429 DF,  p-value: < 2.2e-16

We can see that the variables “mom_workyes” and “mom_age” have high p-values.

We start by fitting simple linear regression models with only one predictor variable. First, create a list of the predictor variables to iterate over.

> cols = colnames(cogn)[!(colnames(cogn) %in% c("kid_score"))]
> cols
[1] "mom_hs"   "mom_iq"   "mom_work" "mom_age"

Fitting kids_score against each predictor variable in the list (“mom_hs” “mom_iq” “mom_work” “mom_age” ) we get the following adjusted R-squared values.

> for (c in cols){
+ adjr = summary(lm(paste("kid_score", "~", c), data=cogn))$adj.r.square
+ print(c(c,adjr))
+ }
[1] "mom_hs"             "0.0539445105919029"
[1] "mom_iq"            "0.199101580842152"
[1] "mom_work"            "0.00965521339400432"
[1] "mom_age"             "0.00616844313235732"

The adjusted R-square values demonstrate that the mother’s IQ would be the best predictor of high school scores.

Fitting all possible combinations is a lot of work (See [1]). We would rather use Python to perform those tasks. I would write a separate blog post to perform the same analysis using python.

We can, however, analyze a few of the models manually. We can perform MLR on models by removing one predictor variable at a time [2].

References:

Multiple Linear Regression (MLR)

Read data

Analysis of data

Published by Saugata

Leave a comment Cancel reply

Read data

Analysis of data

Share this:

Related

Published by Saugata

Leave a comment Cancel reply