Add columns of a CSV file with a single line of bash commands

July 8, 2020July 10, 2020 ~ Saugata ~ Leave a comment

Analysing and extracting results from data stored in CSV files is made simple with spreadsheet programs which can easily import them and support a wide variety of formats. For example the exported data from my HSA contribution file.

12/01/2019	11/30/2019	Interest	None	$0.09
11/12/2019	11/12/2019	CIGNA HEALTHCARE 68	Debit Card	-$25.00
11/04/2019	11/04/2019	Fee Distribution	EFT	-$1.75
11/01/2019	10/31/2019	Interest	None	$0.09
10/02/2019	10/02/2019	Fee Distribution	EFT	-$1.75
10/01/2019	09/30/2019	Interest	None	$0.09
09/04/2019	09/04/2019	Fee Distribution	EFT	-$1.75
09/01/2019	08/31/2019	Interest	None	$0.09
08/03/2019	08/02/2019	Fee Distribution	EFT	-$1.75
08/01/2019	07/31/2019	Interest	None	$0.09
07/05/2019	07/05/2019	CIGNA HEALTHCARE OF AR	Debit Card	-$25.00
07/02/2019	07/02/2019	Fee Distribution	EFT	-$1.75
07/01/2019	06/30/2019	Interest	None	$0.09
06/04/2019	06/04/2019	Fee Distribution	EFT	-$1.75
06/03/2019	06/03/2019	CIGNA HEALTHCARE OF AR	Debit Card	-$25.00
06/01/2019	05/31/2019	Interest	None	$0.09
05/31/2019	05/31/2019	CVS/PHARMACY #07078	Debit Card	-$15.00
05/13/2019	05/13/2019	CIGNA HLTHCRE AZ#HCFS	Debit Card	-$65.80
05/02/2019	05/02/2019	Fee Distribution	EFT	-$1.75
05/01/2019	04/30/2019	Interest	None	$0.09
04/02/2019	04/02/2019	Fee Distribution	EFT	-$1.75
04/01/2019	03/31/2019	Interest	None	$0.10
03/21/2019	03/21/2019	Adjustment Contribution	EFT	$1.75
03/04/2019	03/04/2019	Fee Distribution	EFT	-$1.75
03/01/2019	02/28/2019	Interest	None	$0.18
02/25/2019	02/25/2019	CVS/PHARMACY #07078	Debit Card	-$15.00

HSA contribution CSV file called hsa.csv

To get the 2019 activity only in a spreadsheet program we will need to set up filters and filter out all other years except 2019. That is a lot of mouse events. In command-line we can use just use a single command – grep

Filter for specific column values and then adding a separate column

Now, let’s say we want to find what is the total expenditure in PHARMACY. In spreadsheet we have to filter again by PHARMACY and then sum those columns. While this is not too difficult there can be issues with autosum summing hidden rows too, leading to erroneous results. IN command-line we simply do grep again.

$ cat hsa.csv | grep 2019 | grep PHARMACY

To sum the last column involves extracting that column. A single column or a list of columns can be extracted using the cut command. The -d switch defines the delimiter (separator of fields), which in the case of CSV files in the comma “,”. So the cut command cuts up the file vertically along the delimiter. -f lists the columns to be extracted – as a single number for a single column or as a list 2-5, to extract columns 2 to 5 (for more info see the man page of cut).

$ cat hsa.csv | grep 2019 | grep PHARMACY | cut -d"," -f5

And then we extract the numbers in that column only. While the numbers can be extracted in a variety of ways including sed and awk (which many people prefer) I will use tr (translate) command. tr (translate) command primarily is use to replace characters but it can also remove characters.

$  cat hsa.csv | grep 2019 | grep PHARMACY | \

>  cut -d"," -f5 | tr -cd '[:digit:].\n' 

15.00

15.00

15.00

tr -c chooses the complement of :digits: (0-9), period (.), and newline (\n) i.e. everything else except real numbers and newline. -d deletes those characters.

Now just have to sum those numbers.

$  cat hsa.csv | grep 2019 | grep PHARMACY | \ 

> cut -d"," -f5 | tr -cd '[:digit:].\n' | paste -sd+ | bc


45.00

paste accepts each line as input and joins them using the separator + which then is fed onto the terminal calculator bc which then performs the addition.

Now, this is good for learning. If we do not feel like going through the steps over and over again I have written a script which auto detects the column that needs to be summed and sums that column based on a given filter. The script below can be downloaded from my GitHub page.

#!/bin/bash
# sumcolumn.sh v1.5 by Saugata copyright 07/07/2020
# autosum price column of a csv after filtering for a specific column value

usage="
Usage: ./sumcolumn.sh [filename] [pattern] [-h help]
  
Example: ./sumcolumn.sh hsa.csv PHARMACY
"

# ---- GETOPTS ----
# no args. print usage and exit
if [[ $# -eq 0 ]]; then
 echo "$usage"
 exit
fi

while getopts h option
do
case "${option}"  in
 h) echo "$usage" 
 exit ;;
esac
done

# check if first arg is a file which exists
if [[ -z $1 ]]; then
  echo "File doesn't exist"
  exit
fi
echo "Filename: " $1

# check if file is a text file
txtfile=`file "$1" | grep "text"`

if [[ $txtfile == "" ]]; then
  echo "File not a text file. Choose a text file."
  exit
fi

# check if pattern exists in file
ptn=`grep "$2" "$1"`
if [[ $ptn == "" ]]; then
  echo "Pattern does not exist in file. No sum."
  exit
fi

# identify column containing price (elements have $ symbol)
sumcol="$(head -n 2 "$1" | tail -n 1 | tr "," "\n" | nl \
 | grep "\\$" | sed 's/ //g' | head -c1 )"

if [[ $sumcol == "" ]]; then
  echo "No columns to sum"
  exit
fi

echo "Summing column: " `head -n 1 "$1" | cut -d"," -f$sumcol`

echo "Total: " total="$(grep "$2" "$1" | cut -d"," -f$sumcol \
 | tr -cd '[:digit:].\n' | paste -sd+ | bc)"

Simple script to join files using ffmpeg

May 8, 2020 ~ Saugata ~ Leave a comment

A simple script to join files using ffmpeg. This script works in Linux only.

Save the following code as a script file. For example, save as joinvideos.sh and set the executable permission with chmod 755 joinvideos.sh. Then copy all videos that need to be joined in a single folder containing the script. The script will join any video files it finds in the folder. The script will also join the video files in the alphabetical order of their names. So it is important to rename the files as f1.mp4, f2.mp4, f3.mp4 if the original filenames have no naming order.

#!/bin/bash
# joinvideos.sh v1.5 by Saugata
# join multiple video files with ffmpeg 

# remove temp files if they exist from interrupted past run
if ls temp-*.mp4 1> /dev/null 2>&1; then
  echo "Removing older temp*.mp4 files"
  rm temp-*.mp4
fi

# make a list of all mp4 files in the diretory 
# except the output.mp4 if it exists
listofvidfiles=`ls -1 | grep mp4 | grep -v "output"`

for f in $listofvidfiles;
do
  # check for location of Video stream
  # if Video is at Stream #0:1 then swap to Stream #0:0
  # save file as temp file
  vidstreamno=`ffprobe $f 2>&1 | grep Video | cut -d":" -f2 | cut -d"(" -f1`
  if [ "$vidstreamno" -eq "1" ]; then
    ffmpeg -i $f -map 0:1 -map 0:0 -c copy temp-$f
  else
    cp $f temp-$f
  fi
  # if Video is at Stream #0:0 then copy it to a temp file
  echo $f: "Video stream at" $vidstreamno
done

# create a text files from list of temp-*.mp4 for ffmpeg input
ls -1 *.mp4 | grep mp4 | grep temp | sed "s/^/file '/" | sed "s/$/\'/" > vidfilelist.txt 
echo "-----------------------------"
echo "Joining files ..............."
cat vidfilelist.txt 
echo "-----------------------------"

# use ffmpeg to join mp4 files
ffmpeg -f concat -safe 0 -i vidfilelist.txt -c copy output.mp4

# clean temp files before exit
if ls temp-*.mp4 1> /dev/null 2>&1; then
  echo "Removing temp*.mp4 files"
  rm temp-*.mp4
fi

Getting error: Non-monotonous DTS while concatenating video files in ffmpeg

May 7, 2020May 7, 2020 ~ Saugata ~ 1 Comment

ffmpeg is a versatile tool which can concat multiple videos into a single video with minimal processing load. ffmpeg takes a text file with a list of video files to be joined in the order they are to be concatenated. A sample command will look like this.

ffmpeg -f concat -safe 0 -i mylist.txt -c copy output.mp4

This command can be used universally as long as mylist.txt file contains the correct list of video files that are to be joined. The text file might look like this.

# this is a comment
file 'f1.mp4'
file 'f2.mp4'

I was joining pieces of lecture videos recently using ffmpeg and it was working perfectly till it broke. I started getting the error “Non-monotonous DTS in output stream”.

[mp4 @ 0x559118faf340] Non-monotonous DTS in output stream 0:1; previous: 10392576, current: 9748800; changing to 10392577. This may result in incorrect timestamps in the output file.

The issue seemed to be due to the video file itself. The online forums discussed something about the timestamp which didn’t make sense. I checked the encoding and bitrates and all seemed to be in order except the video and the audios steams are switched in the second video. While f1.mp4 contains the audio stream at Steam #0:0, f2.mp4 contains the audio stream at Steam #0:1.

Screenshot from 2020-05-07 00-44-29

To correct this we used the map function of ffmpeg to swap the audio and the video streams.

$ ffmpeg -i f2.mp4 -map 0:1 -map 0:0 -c copy  new.mp4

And the ffmpeg -f concat started working again!!

References:
[1] https://trac.ffmpeg.org/wiki/Map

A simple script to search inside text files

May 7, 2020May 7, 2020 ~ Saugata ~ Leave a comment

Many times we know that we wrote some code snippet which we want to reuse but cannot remember which file it is in. We need to search inside those python files to get to that code snippet. The following script is devised to make that search easier. It goes into every subdirectory and searches every text file for the string. To get the usage just run the script without any arguments.

#!/bin/bash
# search inside text files : v4.1 by Saugata 

usage="
Usage: ./searchinsidefiles.sh [-s word to search] 
                              [-t file types to search]
                              [-v verbose]
                              [-h help]
  
Example: ./searchinsidefiles.sh -s DataFrame -t py
Example: ./searchinsidefiles.sh DataFrame py
"
# ---- SET INITIAL VALUES ----
word=""
file_ext="*"
verbose=0
# ---- GETOPTS ----
# no args. print usage and exit
if [[ $# -eq 0 ]]; then
 echo "$usage"
 exit
fi

# if $1 doesn't start with a switch - then user have used 
# the other way of passing args
if [[ "$1" =~ ^[a-zA-Z0-9]+$ ]]; then  
 # ---- SET INITIAL VALUES ----
 word=$1
 file_ext=$2

 # Second argument might be empty 
 # which means $file_ext 
 # will be empty at this point too
 #Set the values of $num and $special 
 #to the default values in case they are empty
 [ "$2" == "" ] && file_ext="*"

else
 # user have used a switch to pass args. Use getopts
 while getopts s:t:vh option
 do
  case "${option}"  in
   s) word=${OPTARG};;
   t) file_ext=${OPTARG};;
   v) verbose=1;;
   h) echo "$usage" 
   exit ;;
  esac
 done
fi

# -----------------------

echo
echo "Pattern to search for: " $word
echo "Files being searched: " $file_ext
echo

IFS=$'\n'
filenames=`find . -type f -name "*.$file_ext"` 
for i in $filenames
do
istextfile=`file $i | grep "text"`

if [ "$istextfile" ]; then
 text=`cat $i | grep "$word"`
 if [ "$text" ] ; then 
  echo "-------------------------------------" 
  echo "FILE : " $i 
  if [ "$verbose" -eq "1" ] ; then
   echo
   grep $word $i -A2 -B2 --color=auto
   echo
  fi
 fi
fi
done

A python GUI is going to follow.

Application of Linear Regression with R – Predicting Final Grades based on GPA

April 20, 2020April 20, 2020 ~ Saugata ~ Leave a comment

Application of Linear Regression with R

Predicting Final Grades based on GPA

In the problem below, our objective is to identify if there exists a cutoff GPA such that if a student’s’ GPA is below that threshold they tend to fail the class. This allows to identify at-risk students early in the course. Since the GPA is collected at the beginning of the course and the final grades are obtained at the end of the course, it implies that the GPA should be the predictor variable even though we want to identify a cutoff GPA. Therefore, the model we will fit is

final grade ~ GPA

Let us assume that the final grade is linearly related to the students’ GPA. We can check for linearity in a later section.

> df = read.csv("gpa.csv")

The imported data looks like this. Names are withheld.

Student	CurrentGPA	TotalAbsent	NumericGrade
***ia	3	15.04	89
***lheid	3.67	21.52	83
***ue	1.2	31.63	17
***tha	2.5	24.56	71
***raina	2	54.04	15
***ia	2.8	13.93	72
***issy	3.33	24.33	79
***ob	2.78	8.15	80
***uel	2.67	10.56	73
***hael	3.1	8.26	79

If we plot the quantile values

> boxplot(df$NumericGrade)

it would look like this

and if we plot GPA vs FinalGrade it will look like this

We fit the linear model

> gpafit = lm(NumericGrade ~ CurrentGPA, data=df)

Ideally, we should do a test-train split and fit the training data and repeat till we reduce the bias. However, we are going for efficiency her and not accuracy. We plot the fitted line (red line) on the scatter plot of GPA vs FinalGrade.

> abline(gpafit, col="red")

We want to set a cutoff for the passing grade of 70 and see what is the cutoff GPA for getting 70 and above. Now, we have obtained FinalGrade as a function of GPA and not the other way around. To predict the GPA (x-value) given FinalGrade (y-value) we need to use the approx() function.

> approx(gpafit$fitted.values, df$CurrentGPA, 70)
$x
[1] 70

$y
[1] 2.680097

From the output we see that when x=70, y=2.68 where x is now the FinalGrade and y is the GPA. approx() takes in a list of x-values and the corresponding y-values as tries a linear interpolation of x_pred to get y_pred = approx(x,y,x_pred). We draw two lines, one at FinalGrade = 70 and one at GPA=2.68 to show our cutoffs.

> plot(df$CurrentGPA, df$NumericGrade, col="blue", 
      xlab = "GPA", ylab="FinalGrade")
> abline(gpafit, col="red")
> abline(v=2.68)
> abline(h=70)

It looks like any student with a GPA>2.68 will tend to pass the course. There are some students who passed the course in spite of having GPA<2.68. But all students with GPA>2.68 passed except one. We can even identify the students by doing a python -like slicing of the dataframe.

> df[df$NumericGrade<70 & df$CurrentGPA>2.68,]
   Student CurrentGPA TotalAbsent NumericGrade
59 ***iola          3       28.04           46

We will run the regression analysis with all the predictor variables including student absenteeism in a later post.

Plot a grid of plots in python by iterating over the subplots

March 22, 2020March 22, 2020 ~ Saugata ~ Leave a comment

In this article, we will make a grid of plots in python by iterating over the subplot axes and columns of a pandas dataframe.

Python has a versatile plotting framework in Matplotlib but the documentation seems extremely poor (or I was not able to find the right docs). It took me a fair amount of time to figure out how to send plots of columns of dataframe to individual subplots while rotating the xlabels for each subplot.

Usage

Plotting subplots in Matplotlib begins by using the plt.subplots() statement.

import pandas as pd
import matplotlib.pyplot as plt


fig, axs = plt.subplots(nrows=2, ncols=2)

We can omit the nrows and ncols args but I kept it for effect. This statement generates a grid of 2×2 subplots and returns the overall figure (the object which contains all plots inside it) and the individual subplots as a tuple of subplots. The subplots can be accessed using axs[0,0], axs[0,1], axs[1,0], and axs[1,1]. Or they can be unpacked during the assignment as follows.

import pandas as pd
import matplotlib.pyplot as plt


fig, ((ax1, ax2),(ax3, ax4)) = plt.subplots(nrows=2, ncols=2)

When we have 1 row and 4 columns instead of 2 rows and 2 columns it has to be unpacked as follows.

import pandas as pd
import matplotlib.pyplot as plt


fig, ((ax1, ax2, ax3, ax4)) = plt.subplots(nrows=1, ncols=4)

Flattening the grid of subplots

We, however, do not want to unpack individually. Instead, we would like to flatten the tuple of subplots and iterate over them rather than assigning each subplot to a variable. The tuple is flattened by the flatten() command.

axs.flatten()

We identify 4 columns of a dataframe we want to plot and save the column names in a list that we can iterate over. We flatten the subplots and generate an iterator or we can convert the iterator to a list and then pack it (zip) with the column names.

import pandas as pd
import matplotlib.pyplot as plt


profiles_file = 'data.csv'
df = pd.read_csv(profiles_file)

cols_to_plot = ['age', 'drinking', 'exercise', 'smoking']

fig, axs = plt.subplots(nrows=2, ncols=2)
fig.set_size_inches(20, 10)
fig.subplots_adjust(wspace=0.2)
fig.subplots_adjust(hspace=0.5)

for col, ax in zip(cols_to_plot, axs.flatten()):
    dftemp = df[col].value_counts()
    ax.bar(dftemp.index, list(dftemp))
    ax.set_title(col)
    ax.tick_params(axis='x', labelrotation=30)

plt.show()

As we iterate over each subplot axes, and the column names which are zipped with it, we plot each subplot with the ax.plot() command and we have to supply the x and y values manually. I tried plotting with pandas plot df.plot.bar() and assigning the returned object to the ax. It doesn’t work. The x values for the ax.plot() are the dataframe index (df.index) and y values are the values in the dataframe column (which needs to be converted to a list to as ax.plot() does not accept pd.Series).

Rotate x-axis of subplots

The x-axis for each subplot is rotated using

ax.tick_params(axis='x', labelrotation=30)

Use pandas to convert a date to datetime format

February 24, 2020February 24, 2020 ~ Saugata ~ Leave a comment

Importing dates from a CSV file is always a hassle. With myriads of DateTime formats possible, we will need to write extensive amounts of code to accommodate al possible DateTime formats or put restrictions on the contents of the CSV file. We don’t want to do either. Instead of hard-coding commands like

map(datetime.strftime(string, “%m/%d/%Y))

into our codes, we can use pandas to convert the dates for us. Pandas has the capability to convert an entire column of dates in string format to DateTime format. We just need to be careful when importing just dates and not DateTime objects(strings). Pandas usually converts to DateTime objects. If we are just importing dates then the time components are undesirable. We will need to strip off the time part using .date() at the end. So instead of

pd.to_datetime(date)

we will need to use

pd.to_datetime(date).date()

An example script illustrates this procedure.

def dateformat(self, date):
    # use pandas to convert a date to datetime format
    # extract just the date since pandas returns the date as Timestamp object
    # repack the date as datetime using datetime.datetime.combine() with time = 00:00

    date = dt.datetime.combine(pd.to_datetime(date).date(), 
                               dt.datetime.min.time())
    return date

Multiple linear regression with R

February 19, 2020February 24, 2020 ~ Saugata ~ Leave a comment

We demonstrate multiple linear regression (MLR) with R using a case study. Here we perform an “Analysis of grades for an algebra class”. In this case study, we use multiple analytical tools including multiple linear regression, model selection with p-values, and model verification with a test-train split of data.

Analysis of grades for an algebra class

Objective

Identify predictors of attrition and evaluate methods to measure the difficulty levels of exams

Measuring exam difficulty by comparing the medians

If the student competence remains constant then exams with similar difficulty levels should have the same median score. The spread of the data, or the standard deviation, however, implies a degree of preparedness or level of understanding among the students. So the standard deviation should decrease over time as the course progresses. We plot the data for MH140 Fall 2019 class and observe the trend that is predicted by logic. The data demonstrate that the tests are of the same difficulty level.

This figure demonstrates that ideally, the median score will remain the same across exams while the standard deviation will decrease over time.

R code

The preparation of data requires some effort as the LMS(Learning Management Software) exports all data. We have to select only the relevant columns of data, which are the weighted totals of the grading elements. After extracting only the grading element totals, we compile them into an excel spreadsheet and export it to CSV format, which is then imported into R using the following code.

> df = read.csv("fall2019.csv")

Then a boxplot of quiztotal, midterm, final, and Grade is plotted to observe the medians and the standard deviation of the score.

> boxplot(df$quiztotal, df$midterm, df$final, df$Grade, 
+ names = c("QuizTotal", "Midterm", "Final", "OverallGrade" ) , 
+ col ="deepskyblue2", ylab="Grades (%)", 
+ main = "Comparing the medians of grading elements")

names – labels each boxplot. The names have to be provided as a vector with c(“label1”, “label2”, ..)
ylab – labels y-axis
main – boxplot title
col – color

Identifying predictors of attrition

We analyze the gradebook data to identify which grading elements can be the best predictors of failure rates in a class. For this, we clean the data and prepare it for analysis for statistical software. Then we performed multiple linear regression on the data using the variable Grade (which is the final overall grade that the student receives in the class) as the response variable. The regression is performed against the following variables.

HWtotal q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 quiztotal midterm final

Results of fitting linear regression models

We can fit against all predictor variables by manually typing in the predictor variables, like this.

> lm(Grade ~ HWtotal + quiztotal + midterm + final, data=df)

However, instead of fitting the models manually we will first iterate over all possible models.

> lapply(colnames(df)[2:17], 
+ function(x) summary(lm(paste("Grade", " ~ ", x),
+ data = df))$coefficients)

The lapply() function extracts the column names, which are the predictor variables, and iteratively sends them to the function fittingly called function(x). Inside the function, function(x), the paste function generates the linear model (e.g. Grade ~ midterm) by pasting the predictor variables passed to it by the lapply() function. The lm() function then fits the model, and then the summary() extracts the coefficients. Here are the results for all possible linear models with one predictor variable. There are two methods of model selection – the p-value method and the adjusted R-squared method. We are going to use the p-value method since it is simpler to implement.

	Estimate	p-values
(Intercept)	22.2454211	0.39753564
HWtotal	0.6031689	0.03585739

(Intercept)	74.23412354	9.94E-13
q1	0.09184575	2.06E-01

(Intercept)	68.6311475	1.70E-08
q2	0.1357377	1.40E-01

(Intercept)	70.0885609	3.71E-12
q3	0.1258918	4.05E-02

(Intercept)	69.6827466	1.20E-17
q4	0.1520878	1.17E-04

(Intercept)	75.57848273	8.69E-18
q5	0.08175822	4.98E-02

(Intercept)	77.30462776	1.74E-15
q6	0.04201964	3.89E-01

(Intercept)	61.8194946	6.49E-11
q7	0.2545848	1.08E-03

(Intercept)	75.08851214	1.36E-11
q8	0.06235148	3.57E-01

(Intercept)	66.2838735	1.03E-09
q9	0.1817256	3.35E-02

(Intercept)	74.42307692	1.57E-14
q10	0.07730769	1.20E-01

(Intercept)	78.34591195	3.37E-18
q11	0.03647799	4.06E-01

(Intercept)	47.6436925	1.33E-08
quiztotal	0.4686029	3.42E-06

(Intercept)	49.0215606	9.64E-08
midterm	0.3968172	4.26E-05

(Intercept)	45.6114161	0.0004473079
final	0.4469315	0.0045930408

From the results, it is clear the quiz4, midterm, and quiztotal are the best predictors of the final grades.

Interpretation of results

The following regression models were chosen for their low p-values.

Model	p-values
Grade ~ midterm	4.26E-05
Grade ~ q4	1.17E-04
Grade ~ HWtotal	0.03586
Grade ~ HWtotal + quiztotal + midterm + final	2.20E-16

Quiztotal is replaced with q4 instead, although quiztotal with the lowest p-value has the highest predictive power. The logic is that, since quiztotal is an aggregate of 11 quizzes, regression against it is really a regression with 11 variables with 10 constraints (the relative coefficients are fixed). We see that the model Grade ~ HWtotal + quiztotal + midterm + final has the best predictive power. However, we only have access to the final grades at the very end of the semester so it is of little value at the beginning of the semester when predicting student success can only be based on a few weeks of data.

The p-values signify that the homework is the worst predictor of the final overall grade while the q4 is the best predictor. This is known to the instructor for some time from experience. The data confirms it statistically. Since the students can use external assistance for their homework assignments, the variable HWtotal has no impact on the final grades. Quiz 4 consists of solving linear equations. Students who are unable to learn the simple techniques of solving a linear equation, probably do not have the skills or aptitude to learn the more advanced topics in the class. This implies that the student failure rate can be accurately predicted by week 5.

Let us put that theory to test using a test-train stress test to the model. Since the dataset size is so small it is necessary to use bootstrap methods. However, for simplicity, we are going, to begin with, a simple test-train split of the data. In a test-train split, the data is randomly sorted into a training set and a test set according to a predetermined ratio called the test-train split ratio. The models are then trained on the training set and then tested on the test set to see which model provides the best fit. We will be fitting linear models only at the beginning, so bias-variance tradeoff does not factor in yet. Later, however, we will generalize to higher dimensional models and explore the bias-variance tradeoff.

Linear regression using test-train split

We randomly select 80% of the data into a training sample and the remaining 20% is reserved as test data. We train our models on the training samples (which were randomly selected) using the model Grade ~ Quiz4grades which is a linear regression model.

> gradefit = lm(Grade ~ q4, data = df, subset = trainset)

	Estimate	Std. Error	t -value	p-values
(Intercept)	70.5059	2.59796	27.139	3.64E-14
q4	0.15557	0.03332	4.669	0.000303

Test the model on the test data set

> predict(gradefit,df)[-trainset]

Student #	3	6	7	17	20
Predicted Grade	86.06334	80.92938	75.63985	75.63985	86.06334
Actual Grade	81	68	74	74	84
Error	-5.06334	-12.92938	-1.63985	-1.63985	-2.06334

The MSE (mean squared error) on the test data set is MSE ~ 40. The errors are distributed according to the t-distribution. Therefore, the confidence interval for the errors at the 95% CI can be calculated using the t-distribution and they are (-3.4, +1.3).

> mean(((df$Grade - predict(gradefit,df))[-trainset])^2)
40.48838
> errs = (df$Grade - predict(gradefit,df))
> mean(errs)
-1.060716
> sd(errs)
5.19508
> mean(errs) + qt(0.975, df=16)*sd(errs)/sqrt(22)
1.28728
> mean(errs) - qt(0.975, df=16)*sd(errs)/sqrt(22)
-3.408713

This means that every prediction the model makes is accurate within -3.4 and 1.3 of the actual grade. The model can be written mathematically as follows.

Grade = 0.156*Quiz4grade + 70.5

This is remarkable since 70 is the passing grade. We see that Quiz 4 performance directly impacts the pass-fail rate. What we have really accomplished is that we have identified the predictor of the attrition rate for this class.

Observation

The models trained on available data works really well in the same class of students. This means if the first 4 weeks of data are available on a group of students, their future performance can be predicted with 95% accuracy within 3.4 points of the actual final grade.

Multiple Linear Regression (MLR)

February 9, 2020February 9, 2020 ~ Saugata ~ Leave a comment

Read data

> cogn = read.csv("http://bit.ly/dasi_cognitive")
> head(cogn)

kid_score	mom_hs	mom_iq	mom_work	mom_age
65	yes	121.11753	yes	27
98	yes	89.36188	yes	25
85	yes	115.44316	yes	27
83	yes	99.44964	yes	25
115	yes	92.74571	yes	27
98	no	107.90184	no	18

Analysis of data

Here we are trying to predict kid’s test scores using their mother’s IQ, high school degree, work status, and age. Only a few of there predictor variables have a substantial impact on the kid_scores. As we shall see soon that we can improve the fit (by reducing the Adjusted R-squared) by eliminating a few of these variables. To understand the baseline result we begin by testing against the full model.

> cgn.fit = lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age ,
 data = cogn)
> summary(cgn.fit)
Call:
lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age, 
    data = cogn)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.045 -12.918   1.992  11.563  49.267 

Coefficients:

	Estimate	Std. Error	t -value	Pr(>\|t\|)
(Intercept)	19.59241	9.21906	2.125	0.0341
mom_hsyes	5.09482	2.3145	2.201	0.0282
mom_iq	0.56147	0.06064	9.259	<2e-16
mom_workyes	2.53718	2.35067	1.079	0.281
mom_age	0.21802	0.33074	0.659	0.5101

Residual standard error: 18.14 on 429 degrees of freedom
Multiple R-squared:  0.2171,	Adjusted R-squared:  0.2098 
F-statistic: 29.74 on 4 and 429 DF,  p-value: < 2.2e-16

We can see that the variables “mom_workyes” and “mom_age” have high p-values.

We start by fitting simple linear regression models with only one predictor variable. First, create a list of the predictor variables to iterate over.

> cols = colnames(cogn)[!(colnames(cogn) %in% c("kid_score"))]
> cols
[1] "mom_hs"   "mom_iq"   "mom_work" "mom_age"

Fitting kids_score against each predictor variable in the list (“mom_hs” “mom_iq” “mom_work” “mom_age” ) we get the following adjusted R-squared values.

> for (c in cols){
+ adjr = summary(lm(paste("kid_score", "~", c), data=cogn))$adj.r.square
+ print(c(c,adjr))
+ }
[1] "mom_hs"             "0.0539445105919029"
[1] "mom_iq"            "0.199101580842152"
[1] "mom_work"            "0.00965521339400432"
[1] "mom_age"             "0.00616844313235732"

The adjusted R-square values demonstrate that the mother’s IQ would be the best predictor of high school scores.

Fitting all possible combinations is a lot of work (See [1]). We would rather use Python to perform those tasks. I would write a separate blog post to perform the same analysis using python.

We can, however, analyze a few of the models manually. We can perform MLR on models by removing one predictor variable at a time [2].

References:

Timestamp error

November 10, 2019 ~ Saugata ~ Leave a comment

TypeError: an integer is required (got type Timestamp)

This error is common when a date is extracted from a pandas dataframe datetime column or index. We need to extract the date from the Timestamp object by using the .date() method:

startdate
Timestamp('2019-10-28 05:00:00')

startdate.date()
datetime.date(2019, 10, 28)

If we need to convert to a datetime format then we need to combine the date and the time portion like this

startdate = dt.datetime.combine(startdate.date(), startdate.time())

dt.datetime.combine() combines the date and the time part together to form a datetime object.