Read data
> cogn = read.csv("http://bit.ly/dasi_cognitive") > head(cogn)
| kid_score | mom_hs | mom_iq | mom_work | mom_age |
| 65 | yes | 121.11753 | yes | 27 |
| 98 | yes | 89.36188 | yes | 25 |
| 85 | yes | 115.44316 | yes | 27 |
| 83 | yes | 99.44964 | yes | 25 |
| 115 | yes | 92.74571 | yes | 27 |
| 98 | no | 107.90184 | no | 18 |
Analysis of data
Here we are trying to predict kid’s test scores using their mother’s IQ, high school degree, work status, and age. Only a few of there predictor variables have a substantial impact on the kid_scores. As we shall see soon that we can improve the fit (by reducing the Adjusted R-squared) by eliminating a few of these variables. To understand the baseline result we begin by testing against the full model.
> cgn.fit = lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age , data = cogn) > summary(cgn.fit) Call: lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age, data = cogn) Residuals: Min 1Q Median 3Q Max -54.045 -12.918 1.992 11.563 49.267 Coefficients:
| Estimate | Std. Error | t -value | Pr(>|t|) | |
| (Intercept) | 19.59241 | 9.21906 | 2.125 | 0.0341 |
| mom_hsyes | 5.09482 | 2.3145 | 2.201 | 0.0282 |
| mom_iq | 0.56147 | 0.06064 | 9.259 | <2e-16 |
| mom_workyes | 2.53718 | 2.35067 | 1.079 | 0.281 |
| mom_age | 0.21802 | 0.33074 | 0.659 | 0.5101 |
Residual standard error: 18.14 on 429 degrees of freedom Multiple R-squared: 0.2171, Adjusted R-squared: 0.2098 F-statistic: 29.74 on 4 and 429 DF, p-value: < 2.2e-16
We can see that the variables “mom_workyes” and “mom_age” have high p-values.
We start by fitting simple linear regression models with only one predictor variable. First, create a list of the predictor variables to iterate over.
> cols = colnames(cogn)[!(colnames(cogn) %in% c("kid_score"))] > cols [1] "mom_hs" "mom_iq" "mom_work" "mom_age"
Fitting kids_score against each predictor variable in the list (“mom_hs” “mom_iq” “mom_work” “mom_age” ) we get the following adjusted R-squared values.
> for (c in cols){ + adjr = summary(lm(paste("kid_score", "~", c), data=cogn))$adj.r.square + print(c(c,adjr)) + } [1] "mom_hs" "0.0539445105919029" [1] "mom_iq" "0.199101580842152" [1] "mom_work" "0.00965521339400432" [1] "mom_age" "0.00616844313235732"
The adjusted R-square values demonstrate that the mother’s IQ would be the best predictor of high school scores.
Fitting all possible combinations is a lot of work (See [1]). We would rather use Python to perform those tasks. I would write a separate blog post to perform the same analysis using python.
We can, however, analyze a few of the models manually. We can perform MLR on models by removing one predictor variable at a time [2].
References:
- ryouready.wordpress.com/2009/02/06/r-calculating-all-possible-linear-regression-models-for-a-given-set-of-predictors
- www.coursera.org/learn/linear-regression-model