When a CSV file is imported it is usually imported as a data.frame. The advantage of data.frames is that they can contain disparate data types and the disadvantage is that even numeric data types are imported as characters. So we have a table, not of numeric data types, but strings. Therefore, it needs to be converted to numeric datatype before we can use the data. Below, we will explore how to ETL that table in R so that we can use the imported table as a matrix for numerical operations like calculating a new column from existing columns in the table and finding the correlation between the data in two separate columns (between “Mean grade” and “fail rate”).
(If you want the short version to just convert a data.frame to a matrix with numeric values see )
To begin we need to identify whether a table is imported as a matrix or as a data.frame. We are using data from 8 semesters of algebra class.
> mydata X X2016S1 X2016S2 X2016S3 X2017S1 X2017S2 X2018S1 X2018S2 X2018S3 Spring.2019 1 Mean grade 77 74 85 72 57.45 73.91 83.2 69.98 69.63 2 Std 14 14 12 21 38.28 21.56 6.62 22.44 28.62 3 Total students 5 11 20 22 7 17 9 14 19 4 # of fails 1 3 1 5 3 3 0 4 2 5 fail rate 20.00% 27.27% 5.00% 22.73% 42.86% 17.65% 0.00% 28.57% 10.53% > typeof(mydata) [1] "list" > class(mydata) [1] "data.frame"
Now we cannot manipulate this dataframe because the data is in string format. We also need to invert the matrix to do the correlation among the “Mean grade” and the “fail rate”.
Let’s start with the transpose
> alg=t(mydata) > alg [,1] [,2] [,3] [,4] [,5] X "Mean grade" "Std" "Total students" "# of fails" "fail rate" X2016S1 "77" "14" "5" "1" "20.00%" X2016S2 "74" "14" "11" "3" "27.27%" X2016S3 "85" "12" "20" "1" "5.00%" X2017S1 "72" "21" "22" "5" "22.73%" X2017S2 "57.45" "38.28" "7" "3" "42.86%" X2018S1 "73.91" "21.56" "17" "3" "17.65%" X2018S2 "83.2" "6.62" "9" "0" "0.00%" X2018S3 "69.98" "22.44" "14" "4" "28.57%" Spring.2019 "69.63" "28.62" "19" "2" "10.53%" > typeof(alg) [1] "character" > class(alg) [1] "matrix"
alg[2,3] = “5” a character and not a numeric value as we would like. We can check this by applying is.numeric() function on the dataframe (dataframe in R is called data.frame but I will refer them to as dataframes for compatibility with pandas lingo).
> alg[2,3] X2016S1 "5" > is.numeric(alg[2,3]) [1] FALSE > alg[2,] [1] "77" "14" "5" "1" "20.00%" > is.numeric(alg[2,]) [1] FALSE
Now, instead of finding out the data type of each element individually it is faster to apply the is.numeric() function to the whole table with the apply() function. The apply() function takes two extra parameters other than the variable it’s acting on. The second parameter c(1,2) specifies that the apply() function should cat on both rows (1) and columns (2). We have to pass it as a vector with c(1,2) because that is how R accepts data. The third parameter is the function which is to be applied, in this case, is.numeric().
> apply(alg[,c(1:4)],c(1,2),is.numeric) Mean grade Std Total students # of fails X2016S1 FALSE FALSE FALSE FALSE X2016S2 FALSE FALSE FALSE FALSE X2016S3 FALSE FALSE FALSE FALSE X2017S1 FALSE FALSE FALSE FALSE X2017S2 FALSE FALSE FALSE FALSE X2018S1 FALSE FALSE FALSE FALSE X2018S2 FALSE FALSE FALSE FALSE X2018S3 FALSE FALSE FALSE FALSE Spring.2019 FALSE FALSE FALSE FALSE
So we have two more things to do before we can use the matrix alg
- Convert the first row of the data frame to a header
- Cast all numbers to numeric data type
Convert the first row of the data frame to a header
In order to update the column headers we use the colnames() function.
> colnames(mydata) [1] "X" "X2016S1" "X2016S2" "X2016S3" "X2017S1" "X2017S2" "X2018S1" [8] "X2018S2" "X2018S3" "Spring.2019" > colnames(alg) NULL
So the colnames are empty for alg now. During the transpose with t() the rownames for the dataframe mydata got transferred to the first row of the dataframe alg. Therefore, we have to copy the contents of the first row of alg viz. alg[1,] to its colname and then delete that row.
> alg[1,] [1] "Mean grade" "Std" "Total students" "# of fails" "fail rate" > colnames(alg) = alg[1,] # assigns the first row as the header > colnames(alg) [1] "Mean grade" "Std" "Total students" "# of fails" "fail rate" > alg = alg[-1,] # removes the first row > alg Mean grade Std Total students # of fails fail rate X2016S1 "77" "14" "5" "1" "20.00%" X2016S2 "74" "14" "11" "3" "27.27%" X2016S3 "85" "12" "20" "1" "5.00%" X2017S1 "72" "21" "22" "5" "22.73%" X2017S2 "57.45" "38.28" "7" "3" "42.86%" X2018S1 "73.91" "21.56" "17" "3" "17.65%" X2018S2 "83.2" "6.62" "9" "0" "0.00%" X2018S3 "69.98" "22.44" "14" "4" "28.57%" Spring.2019 "69.63" "28.62" "19" "2" "10.53%"
Cast all numbers to numeric data type
Before converting the matrix alg to numeric, we need to remove the last column whose values are in percentages that cannot be cast into a numeric data type.
Select only first few columns of a matrix
To select the first few columns of a matrix we leave the row number empty and list the columns as a vector. We can manually create a vector of column numbers or if the column numbers are contiguous (as in our case) we can simply use the c() function. So alg[ , (1,2,3,4)] can be written as alg[,c(1,4)]. And then apply the as.numeric() function to each value using the apply() function.
> alg2 = apply(alg[,c(1:4)],c(1,2),as.numeric) Mean grade Std Total students # of fails X2016S1 77.00 14.00 5 1 X2016S2 74.00 14.00 11 3 X2016S3 85.00 12.00 20 1 X2017S1 72.00 21.00 22 5 X2017S2 57.45 38.28 7 3 X2018S1 73.91 21.56 17 3 X2018S2 83.20 6.62 9 0 X2018S3 69.98 22.44 14 4 Spring.2019 69.63 28.62 19 2
alg2 is a matrix consisting of numerical values on which we can perform mathematical operations. We can recalculate the column “fail rate” (which we had to drop as it was not numeric) now using columns 3 ( “# of fails”) and 4 (“Total students”).
> failrate = alg2[,4]/alg2[,3]*100 X2016S1 X2016S2 X2016S3 X2017S1 X2017S2 X2018S1 X2018S2 X2018S3 Spring.2019 20.00000 27.27273 5.00000 22.72727 42.85714 17.64706 0.00000 28.57143 10.52632 > alg = cbind(alg2, failrate) Mean grade Std Total students # of fails failrate X2016S1 77.00 14.00 5 1 20.00000 X2016S2 74.00 14.00 11 3 27.27273 X2016S3 85.00 12.00 20 1 5.00000 X2017S1 72.00 21.00 22 5 22.72727 X2017S2 57.45 38.28 7 3 42.85714 X2018S1 73.91 21.56 17 3 17.64706 X2018S2 83.20 6.62 9 0 0.00000 X2018S3 69.98 22.44 14 4 28.57143 Spring.2019 69.63 28.62 19 2 10.52632
The cbind() function attaches the column failrate to the end of the matrix alg2. We assigned the result back to alg replacing the old contents.
Now, we are ready to perform a correlation between the columns “failrate” and “Mean grade”.
> cor(alg[,1],alg[,5]) [1] -0.8562738
A strong negative correlation suggests that a higher “Mean grade” implies a lower “failrate”.
In order to visualize this relationship, we have to plot the scatter diagram and the regression line. It is faster to use the lm() function to do the regression fitting. However, lm() only accepts dataframes. So we convert our matrix back to a dataframe before feeding it to the lm() function.
> alg.df = data.frame(alg) Mean.grade Std Total.students X..of.fails failrate X2016S1 77.00 14.00 5 1 20.00000 X2016S2 74.00 14.00 11 3 27.27273 X2016S3 85.00 12.00 20 1 5.00000 X2017S1 72.00 21.00 22 5 22.72727 X2017S2 57.45 38.28 7 3 42.85714 X2018S1 73.91 21.56 17 3 17.64706 X2018S2 83.20 6.62 9 0 0.00000 X2018S3 69.98 22.44 14 4 28.57143 Spring.2019 69.63 28.62 19 2 10.52632
We notice that the spaces have been replaced by periods “.” and special characters (#) by “X”. This is to facilitate writing the regression relation that is passed to the linear regression module.
> alg.fit = lm(failrate ~ Mean.grade,data=alg.df) > summary(alg.fit) Call: lm(formula = failrate ~ Mean.grade, data = alg.df) Residuals: Min 1Q Median 3Q Max -14.330 -1.289 1.151 4.199 8.461 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 121.1786 23.3301 5.194 0.00126 ** Mean.grade -1.3833 0.3154 -4.386 0.00321 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.235 on 7 degrees of freedom Multiple R-squared: 0.7332, Adjusted R-squared: 0.6951 F-statistic: 19.24 on 1 and 7 DF, p-value: 0.003211
As we can see the linear model explains 73% of the variability in the data (R2 is 0.73) and the low p-value of the coefficients suggest a very good fit (p-value of the slope is 0.00321 which is much smaller than the required 0.05).
> plot(alg.df$Mean.grade,alg.df$failrate, col="blue", pch=16, xlab="Mean grade", ylab="Fail rate") > abline(alg.fit, col="red")
