Select only first few columns of a matrix in R

The syntax to select a few columns of a matrix (or a data.frame) in R is

matrixname[ , <column list>]

To select the first few columns of a matrix we leave the row number empty and list the columns as a vector. We can manually create a vector of column numbers or if the column numbers are contiguous (as in our case) we can simply use the c() function. So alg[ , (1,2,3,4)] can be written as  alg[,c(1,4)]. 

> alg2 = alg[,c(1:4)]
            Mean grade   Std Total students # of fails
X2016S1          77.00 14.00              5          1
X2016S2          74.00 14.00             11          3
X2016S3          85.00 12.00             20          1
X2017S1          72.00 21.00             22          5
X2017S2          57.45 38.28              7          3
X2018S1          73.91 21.56             17          3
X2018S2          83.20  6.62              9          0
X2018S3          69.98 22.44             14          4
Spring.2019      69.63 28.62             19          2

 

ETL a CSV file to a matrix in R

When a CSV file is imported it is usually imported as a data.frame. The advantage of data.frames is that they can contain disparate data types and the disadvantage is that even numeric data types are imported as characters. So we have a table, not of numeric data types, but strings. Therefore, it needs to be converted to numeric datatype before we can use the data. Below, we will explore how to ETL that table in R so that we can use the imported table as a matrix for numerical operations like calculating a new column from existing columns in the table and finding the correlation between the data in two separate columns (between “Mean grade” and “fail rate”). 

(If you want the short version to just convert a data.frame to a matrix with numeric values see )

To begin we need to identify whether a table is imported as a matrix or as a data.frame. We are using data from 8 semesters of algebra class.

> mydata
               X X2016S1 X2016S2 X2016S3 X2017S1 X2017S2 X2018S1 X2018S2 X2018S3 Spring.2019
1     Mean grade      77 74 85      72 57.45 73.91 83.2   69.98 69.63
2            Std 14    14 12 21   38.28 21.56 6.62   22.44 28.62
3 Total students       5 11 20 22       7 17 9 14   19
4     # of fails       1 3 1       5 3 3 0       4 2
5      fail rate  20.00% 27.27%   5.00% 22.73% 42.86%  17.65% 0.00% 28.57%     10.53%

> typeof(mydata)
[1] "list"

> class(mydata)
[1] "data.frame"


Now we cannot manipulate this dataframe because the data is in string format. We also need to invert the matrix to do the correlation among the “Mean grade” and the “fail rate”.

Let’s start with the transpose

> alg=t(mydata)

> alg                           [,1]         [,2]        [,3]                      [,4]               [,5]       
X           "Mean grade" "Std"   "Total students" "# of fails" "fail rate"
X2016S1     "77"         "14"    "5"              "1"          "20.00%"   
X2016S2     "74"         "14"    "11"             "3"          "27.27%"   
X2016S3     "85"         "12"    "20"             "1"          "5.00%"    
X2017S1     "72"         "21"    "22"             "5"          "22.73%"   
X2017S2     "57.45"      "38.28" "7"              "3"          "42.86%"   
X2018S1     "73.91"      "21.56" "17"             "3"          "17.65%"   
X2018S2     "83.2"       "6.62"  "9"              "0"          "0.00%"    
X2018S3     "69.98"      "22.44" "14"             "4"          "28.57%"   
Spring.2019 "69.63"      "28.62" "19"             "2"          "10.53%"   

> typeof(alg)
[1] "character"

> class(alg)
[1] "matrix"


alg[2,3] = “5” a character and not a numeric value as we would like. We can check this by applying
is.numeric() function on the dataframe (dataframe in R is called data.frame but I will refer them to as dataframes for compatibility with pandas lingo).

> alg[2,3]
X2016S1 
    "5" 

> is.numeric(alg[2,3])
[1] FALSE

> alg[2,]
[1] "77"     "14"     "5"      "1"      "20.00%"

> is.numeric(alg[2,])
[1] FALSE


Now,  instead of finding out the data type of each element individually it is faster to apply the
is.numeric() function to the whole table with the apply() function. The apply() function takes two extra parameters other than the variable it’s acting on. The second parameter c(1,2) specifies that the apply() function should cat on both rows (1) and columns (2). We have to pass it as a vector with c(1,2)  because that is how R accepts data. The third parameter is the function which is to be applied, in this case, is.numeric().

> apply(alg[,c(1:4)],c(1,2),is.numeric)
            Mean grade   Std Total students # of fails
X2016S1          FALSE FALSE          FALSE      FALSE
X2016S2          FALSE FALSE          FALSE      FALSE
X2016S3          FALSE FALSE          FALSE      FALSE
X2017S1          FALSE FALSE          FALSE      FALSE
X2017S2          FALSE FALSE          FALSE      FALSE
X2018S1          FALSE FALSE          FALSE      FALSE
X2018S2          FALSE FALSE          FALSE      FALSE
X2018S3          FALSE FALSE          FALSE      FALSE
Spring.2019      FALSE FALSE          FALSE      FALSE


So we have two more things to do before we can use the matrix
alg

  • Convert the first row of the data frame to a header
  • Cast all numbers to numeric data type

Convert the first row of the data frame to a header

In order to update the column headers we use the colnames() function.

> colnames(mydata)
 [1] "X"           "X2016S1"     "X2016S2"     "X2016S3"     "X2017S1"     "X2017S2"     "X2018S1"    
 [8] "X2018S2"     "X2018S3"     "Spring.2019"

> colnames(alg)
NULL


So the colnames are empty for
alg now. During the transpose with t() the rownames for the dataframe mydata got transferred to the first row of the dataframe alg. Therefore, we have to copy the contents of the first row of alg viz. alg[1,] to its colname and then delete that row.

> alg[1,]
[1] "Mean grade" "Std" "Total students" "# of fails" "fail rate"     

> colnames(alg) = alg[1,]   # assigns the first row as the header

> colnames(alg)
[1] "Mean grade" "Std" "Total students" "# of fails" "fail rate"  
> alg = alg[-1,]  # removes the first row

> alg
            Mean grade Std     Total students # of fails fail rate
X2016S1     "77"       "14"    "5"            "1"        "20.00%" 
X2016S2     "74"       "14"    "11"           "3"        "27.27%" 
X2016S3     "85"       "12"    "20"           "1"        "5.00%"  
X2017S1     "72"       "21"    "22"           "5"        "22.73%" 
X2017S2     "57.45"    "38.28" "7"            "3"        "42.86%" 
X2018S1     "73.91"    "21.56" "17"           "3"        "17.65%" 
X2018S2     "83.2"     "6.62"  "9"            "0"        "0.00%"  
X2018S3     "69.98"    "22.44" "14"           "4"        "28.57%" 
Spring.2019 "69.63"    "28.62" "19"           "2"        "10.53%" 

Cast all numbers to numeric data type

Before converting the matrix alg to numeric, we need to remove the last column whose values are in percentages that cannot be cast into a numeric data type.

Select only first few columns of a matrix

To select the first few columns of a matrix we leave the row number empty and list the columns as a vector. We can manually create a vector of column numbers or if the column numbers are contiguous (as in our case) we can simply use the c() function. So alg[ , (1,2,3,4)] can be written as  alg[,c(1,4)]. And then apply the as.numeric() function to each value using the apply() function.

> alg2 = apply(alg[,c(1:4)],c(1,2),as.numeric)
            Mean grade   Std Total students # of fails
X2016S1          77.00 14.00              5          1
X2016S2          74.00 14.00             11          3
X2016S3          85.00 12.00             20          1
X2017S1          72.00 21.00             22          5
X2017S2          57.45 38.28              7          3
X2018S1          73.91 21.56             17          3
X2018S2          83.20  6.62              9          0
X2018S3          69.98 22.44             14          4
Spring.2019      69.63 28.62             19          2


alg2 is a matrix consisting of numerical values on which we can perform mathematical operations. We can recalculate the column “fail rate” (which we had to drop as it was not numeric) now using columns 3 ( “# of fails”) and  4 (“Total students”).

> failrate = alg2[,4]/alg2[,3]*100
    X2016S1     X2016S2     X2016S3     X2017S1     X2017S2     X2018S1     X2018S2     X2018S3 Spring.2019 
   20.00000    27.27273     5.00000    22.72727    42.85714    17.64706     0.00000    28.57143    10.52632 


> alg = cbind(alg2, failrate)
            Mean grade   Std Total students # of fails failrate
X2016S1          77.00 14.00              5          1 20.00000
X2016S2          74.00 14.00             11          3 27.27273
X2016S3          85.00 12.00             20          1  5.00000
X2017S1          72.00 21.00             22          5 22.72727
X2017S2          57.45 38.28              7          3 42.85714
X2018S1          73.91 21.56             17          3 17.64706
X2018S2          83.20  6.62              9          0  0.00000
X2018S3          69.98 22.44             14          4 28.57143
Spring.2019      69.63 28.62             19          2 10.52632


The
cbind() function attaches the column failrate to the end of the matrix alg2. We assigned the result back to alg replacing the old contents.

Now, we are ready to perform a correlation between the columns “failrate” and “Mean grade”.

> cor(alg[,1],alg[,5])
[1] -0.8562738


A strong negative correlation suggests that a higher “Mean grade” implies a lower “failrate”.

In order to visualize this relationship, we have to plot the scatter diagram and the regression line. It is faster to use the lm() function to do the regression fitting. However, lm() only accepts dataframes. So we convert our matrix back to a dataframe before feeding it to the lm() function. 

> alg.df = data.frame(alg)
            Mean.grade   Std Total.students X..of.fails failrate
X2016S1          77.00 14.00              5           1 20.00000
X2016S2          74.00 14.00             11           3 27.27273
X2016S3          85.00 12.00             20           1  5.00000
X2017S1          72.00 21.00             22           5 22.72727
X2017S2          57.45 38.28              7           3 42.85714
X2018S1          73.91 21.56             17           3 17.64706
X2018S2          83.20  6.62              9           0  0.00000
X2018S3          69.98 22.44             14           4 28.57143
Spring.2019      69.63 28.62             19           2 10.52632


We notice that the spaces have been replaced by periods “.” and special characters (#) by “X”. This is to facilitate writing the regression relation that is passed to the linear regression module.

> alg.fit = lm(failrate ~ Mean.grade,data=alg.df)
> summary(alg.fit)

Call:
lm(formula = failrate ~ Mean.grade, data = alg.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.330  -1.289   1.151   4.199   8.461 

Coefficients:
                      Estimate   Std. Error t value Pr(>|t|)   
(Intercept) 121.1786    23.3301   5.194  0.00126 **
Mean.grade   -1.3833     0.3154  -4.386  0.00321 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.235 on 7 degrees of freedom
Multiple R-squared:  0.7332,    Adjusted R-squared:  0.6951 
F-statistic: 19.24 on 1 and 7 DF,  p-value: 0.003211


As we can see the linear model explains 73% of the variability in the data (R
2 is 0.73) and the low p-value of the coefficients suggest a very good fit (p-value of the slope is 0.00321 which is much smaller than the required 0.05).

> plot(alg.df$Mean.grade,alg.df$failrate, col="blue", pch=16, 
                          xlab="Mean grade", ylab="Fail rate")
> abline(alg.fit, col="red")

null

Importing data from the clipboard in R

Importing data from the clipboard

Copy the data from Google Sheets or Excel and import the data from the clipboard using the function

> x1 = read.table(file="data-share", sep = "\t", 
+ na.strings = "NA", header = T)
> importeddata = read.excel()

 

This function assumes the header row is copied by default. However, this import function does not seem to work all the time. Sometimes it works fine but then it will give errors like

> importeddata = read.excel()
Warning message:
In read.table("clipboard", sep = "\t", header = header) :
incomplete final line found by readTableHeader on 'clipboard'
> importeddata = read.excel()
Error in read.table("clipboard", sep = "\t", header = header) : 
 first five rows are empty: giving up
In addition: Warning message:
In read.table("clipboard", sep = "\t", header = header) :


So I gave up using it and use a workaround. I copy and paste the data in a text editor (Gedit) and save that data as plain text as a file called data-share. Then import that file using the read.table() function.

Importing data into R using read.table()

read.table() is a versatile function which can import any table

> importedata = read.table(file="data-share", sep = "\t", 
+ na.strings = "NA", header = T)
> importedata
  x1 y1 x2 y2
1   2 2  6 5
2   2 5  7 4
3   6 5  8 7
4   7 3  5 6
5   4 7  5 4
6   6 4 NA NA
7   5 3 NA NA
8   4 6 NA NA
9   2 5 NA NA
10  1 3 NA NA

 
We can extract individual columns using the array notation

> importedata[1]
  x1
1   2
2   2
3   6
4   7
5   4
6   6
7   5
8   4
9   2
10  1
> importedata[3]
  x2
1   6
2   7
3   8
4   5
5   5
6  NA
7  NA
8  NA
9  NA
10 NA

 
However, we cannot find the means of individual columns just by applying the mean() function on these lists.

> mean(importedata[1])
[1] NA
Warning message:
In mean.default(importedata[1]) : argument is not numeric or logical: returning NA

 
The reason is that any extracted column is also a list.  importedata and all of its extracted parts contain a header and are therefore considered “list” objects. The easiest solution is to apply the mean() function to each column using the sapply() function. This function acts just like the map() function of Python.

> sapply(importedata, mean, na.rm=T)
x1  y1 x2  y2 
3.9 4.3 6.2 5.2

 
Usage: sapply(data, function, arguments to function)

We needed to use the argument na.rm=TRUE for the function mean() otherwise it will return NA for the last two rows which contain many NA values. The last two columns have less data than the first two. They contain NA values for the rows which are missing data so the last two columns would evaluate to NA if the arg na.rm=T is not used.

> sapply(importedata,mean)
x1  y1 x2  y2 
3.9 4.3  NA NA

 
The alternative to using sapply() would be to extract each column, convert using unlist() function, assign the result to a vector and then find the mean of that vector.

> data1 = unlist(importedata[1], use.names = F)
> mean(data1)
[1] 3.9

 
The first row strips the header, flattens the columns and returns a vector of numbers. The second row finds the mean.

Setting up a virtual environment for Python

Many specialized tools are written for some version of Python like python2.7 and has dependencies on some versions of packages like pandas 0.7.3. Installing these older versions will remove newer versions and create conflicts with existing code. So a better option is to create a virtual environment with the specific package versions only. For example, QSTK does not work with Python 3 or pandas 0.21. It only works with python2.7 and pandas 0.7.3. So we have to create a virtual environment and install these versions.

virtualenv --python=/usr/bin/python2.7 ~/python2.7-virtual-env


This will create the
~/python2.7-virtual-env directory if it doesn’t exist, and also create directories inside it containing a copy of the Python interpreter, the standard library, and various supporting files. Now, we have to go to that directory and run source activate to start a new environment (just like a chroot environment).

source ~/python2.7-virtual-env/bin/activate


This will start a new environment. To test that we are really in the environment 

$ which python

~/python2.7-virtual-env/bin/python

$ python --version
Python 2.7.12


The environment is now using the local version of python which is python 2.7. 
Now we can install the older versions of the required packages.

pip install pandas==0.7.3


==0.7.3 forces install of the version 0.7.3 of pandas. It removes newer versions if already installed by default.

Setting up a virtual environment in Anaconda

Now Anaconda itself is a virtual environment with the latest version of scientific and statistical tools. However, there will be instances where certain older codes will not run with newer versions of the packages. For example, the pandas datareader library which pulls data from Yahoo and Morningstar is broken in version 0.6.0 (See my GitHub page github.com/saugatach/stockanalysis). Let us say we are trying to work around this issue and want to get back pandas-datareader v0.5.0 but also want to keep the latest pandas-datareader v0.6.0. So we create a separate virtual environment within Anaconda called “stocks”. The process is very well detailed in the conda docs conda.io/docs/user-guide/tasks/manage-environments.html.

conda create --name stocks python=3.6 pandas-datareader==0.5.0

This creates a virtenv called stocks which has python 3.6 and the older pandas-datareader. We can activate the environment by

$ source activate stocks

The CLI prompt should have the environment name as the prefix. We can check if the correct version of our package is installed.

(stocks) $ pip3 list
........
numpy (1.15.1)
pandas (0.23.4)
pandas-datareader (0.5.0)
pandocfilters (1.4.2)

parso (0.3.1)