Convert column to a float from str dropping non-numeric strings

June 14, 2019June 14, 2019 ~ Saugata ~ Leave a comment

Let us say we have the following dataframe

df[‘Amount $’]

0.07

1.154

2.596

X-Links

Amount $

0.102

And we want to convert all numbers to float and drop the non-numeric rows. isnumeric() will not work since this data is all str dtype. The only option is to write a small function which tries to convert a string to a float. If it fails it returns FALSE. If this function is mapped to the entire column using a lambda function then it will return a boolean list(series) where TRUE means float and FALSE means non-float. When this is used as a boolean mask on the dataframe, it will filter out the non-numeric rows.

def tryfloat(self, f):
       try:
           float(f)
           return True

       except ValueError:
           return False

df[ df['Amount $'].apply(lambda x: tryfloat(x)) ]

Result is this table

0.07

1.154

2.596

0.102

Barplots in R

May 6, 2019May 6, 2019 ~ Saugata ~ Leave a comment

Basic barplot

To draw the basic barplots we use the in-built data table mtcars.

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If we wish to plot the frequency of gears found in the database we first need to create a frequency table by aggregating all the unique values of the “gear” column. table() helps aggregate unique values and their frequencies.

> tmp=table(mtcars$gear)
3 4 5
15 12 5
> barplot(tmp)

Grouped barplot

The grouped barplot takes more effort.

> tmp=table(mtcars$cyl, mtcars$gear)

3 4 5
4 1 8 2
6 2 4 1
8 12 0 2

> barplot(tmp,beside = T, legend = paste(rownames(tmp),"cylinders"), 
  col = c(1:length(rownames(tmp))), xlab = "Gears/cylinders", 
  ylab = "Count")

rownames(tmp) extracts unique values of the column “cyl”(cylinders). col cycles through colors. Legend takes the rownames(tmp) as an argument but we want to append the string cylinders so that it is clear what the numbers signify. That is achieved by the paste() command which “pastes” the string “cylinders” at the end.

References:
https://www.statmethods.net/graphs/bar.html

Plotting multiple graphs in R

April 24, 2019April 24, 2019 ~ Saugata ~ Leave a comment

There are two ways to plotting multiple graphs in R.

Create a grid of separate graphs on the same plot
Overlay all graphs in a single frame of one plot

Create a grid of separate graphs on the same plot

A grid in R is created using the par() command. An example where we plot 4 graphs in a single plot.

> x=seq(-4,4,.1)
> par(mfrow=c(2,2))
> plot(x,dt(x,3),'l',col='red')
> plot(x,dt(x,5),'l',col='blue')
> plot(x,dt(x,10),'l',col='black')
> plot(x,dnorm(x),'l',col='magenta')

We generate a vector of values between (-4,+4) called ‘x’ and then feed that into the point distribution function (pdf) of Students t-distribution – df(z,df) – which accepts the z-score and the degrees of freedom (df) to generate the pdf. We plot 3 graphs with df = 3, 5, 10. And then we plot a pdf of the Normal Distribution – pnorm(z). Students t-distribution makes more sense with smaller sample sizes and probably with stock prices which tend to have extreme events.

The 2×2 plots – 4 graphs in 2 rows is made possible by the command par(mfrow=c(2,2)). For a 2×1 matrix of plots, we would have used par(mfrow=(c2,1)).

Rplot2

The graphs all quite different since the Students t-distribution has fatter tails than the Normal Distribution. However, that is not very apparent from the side-by-side graphs. We will need to overlay the graphs on top of each other to see the tiny differences in the graphs.

To return to regular single plot grid use

> par(mfrow=c(1,1))

Overlay all graphs in a single frame of one plot

To overlay multiple plots on each other we start by plotting the first graph with all its axes and labels. The only restriction is that we will need to limit the y-interval between specific values (if x is the independent variable, if y is the independent variable then will need to restrict x-interval). ylim=c(0,0.4) restricts the plot between the y-interval (0,0.4).

> par(mfrow=c(1,1))
> plot(x,dnorm(x),'l',col=1,ylim=c(0,0.4))

Rplot1

Now, we overlay the rest of the plots over this one by looping over pdf of the Students t-distribution with df = 3,5,10 (which we plotted in the previous section).

> for (val in c(3,5,10)){
+     par(new=T)
+     plot(x,dt(x,val),'l',col=val,axes=F,xlab="",ylab="")
+ }

The overlay is achieved using the par(new=T) command (same as par(new=TRUE)). It tells R that we are plotting a new graph on the same space. We also needed to turn off the axes (axes=F) and the labels for the x-axes and the y-axes (xlab=””,ylab=””). Otherwise, the text of different graphs will overlay on each other and make it look messy. In the resulting graph, we can see the fat tails of the Students t-distribution. Rplot4

Dataframe manipulation with pandas

April 17, 2019 ~ Saugata ~ Leave a comment

Merge databases

db1 = pd.DataFrame({'Name':['Jones','Will','Rory','Bryce','Hok'],
 'job_id':[2,5,3,7,2]}, index=[1,2,3,4,5])



db2 = pd.DataFrame({'Name':['CEO','Chairman','Vice-Chairman',
'Senior Engineer'], 'job_id':[5,1,2,3]}, index=[1,2,3,4])

df = pd.merge(db1,db2,on='job_id')

	Name_x	job_id	Name_y
0	Jones	2	Vice-Chairman
1	Hok	2	Vice-Chairman
2	Will	5	CEO
3	Rory	3	Senior Engineer

merge() automatically removes the rows which contain null placeholder values similar to inner join and renames the columns appropriately.

https://pandas.pydata.org/pandas-docs/stable/merging.html

Extracting rows from a dataframe by row number using iloc

>>> df.iloc[2]
Name_x    Will
job_id       5
Name_y     CEO
Name: 2, dtype: object

Extracting rows which match a string value

Syntax: df[ ( df[‘col’] == “some value” ) ]

(hpi[‘RegionName’] == “Mesa”) generates a Boolean set which can then be used to extract the rows which are True from hpi[]. Note that the ( ) are crucial to the operation of converting it to a set.

# select all rows where the RegionName is "Mesa"
mesadataall = hpi[ (hpi['RegionName'] == "Mesa")  ]

Cleaning databases using replace()

# clean data with sed like REGEX
# remove all (2014) references
moddata.replace(r" \(.*\)", "", inplace=True, regex=True) 
# replace the word unavailable by 0 
moddata.replace(r"unavailable", "0", inplace=True, regex=True)

These REGEX clean the data by removing non-numeric data and replacing them by 0.

Web scrape tables from website using pandas

data = pd.read_html(
'https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate')
# entire HTML is imported as a list
# the table in is the fourth element of the list

df = data[4]

to be continued …

Jupyter notebook running the wrong python version

March 31, 2019March 31, 2019 ~ Saugata ~ 1 Comment

When multiple versions of python are installed in a system along with anaconda3, jupyter kernels might run the wrong python version. Here is an example.

When we start the Python 2 kernel explicitly from the drop-down menu, we expect Jupyter to be running Python 2. But that is not the case as verified below.

The culprit is the kernel.json file in the jupyter kernel folder at /usr/share/jupyter/kernels/python2.

The kernel.json file asks jupyter to run /usr/bin/python. But /usr/bin/python points to python3 and not python2. Therefore, jupyter ends up running python 3 . We will need to replace python with python2 in the kernel.json file (/usr/share/jupyter/kernels/python2).

{
"argv": [
"/usr/bin/python2", 
"-m", 
"ipykernel_launcher", 
"-f", 
"{connection_file}"
], 
"display_name": "Python 2", 
"language": "python"
}

Save the file and restart jupyter.

I had to repeat a similar procedure with the Sagemath Jupyter kernel. /usr/share/jupyter/kernels/sagemath had the following kernel declaration for jupyter that was causing the kernel to crash.

{
"display_name": "SageMath 8.1", 
"argv": [
 "/usr/bin/sage",
 "--python",
 "-m",
 "sage.repl.ipython_kernel",
 "-f",
 "{connection_file}"
]
}

When the python was replaced with python2 it started working.

{
"display_name": "SageMath 8.1", 
"argv": [
 "/usr/bin/sage",
 "--python2",
 "-m",
 "sage.repl.ipython_kernel",
 "-f",
 "{connection_file}"
]
}

For Sagemath installed from source this will never happen. However, if Sagemath was installed from the repositories (sudo apt install sagemath sagemath-common) then this error is inevitable due to the dependency of sagemath python 2 (see my article https://bytesofcomputerwisdom.home.blog/2019/03/23/sagemathwont-run-no-module-named-sage-repl/).

How to search python interpreter history

March 31, 2019 ~ Saugata ~ Leave a comment

To search the python interpreter history we can use the following code as it is with the string search_string replaced by string to be searched.

>>> import readline
>>> for i in range(readline.get_current_history_length()):
… x=readline.get_history_item(i+1)
… if “search_string” in x:
… print(x)
…

Note: The indentation of the code is crucial to its function.

Draw filled plots in python

March 26, 2019 ~ Saugata ~ Leave a comment

Matplotlib fill function can fill a polygon given the coordinates of its vertices. See the first example in the following link.

https://matplotlib.org/gallery/lines_bars_and_markers/fill.html

This method can be extended to draw filled curves or shade the area under a curve. For example, if we want to shade the area under a normal distribution up to the z-score -1.4 like this

then we will first generate a list of (x,y) coordinate pairs along the curve of the distribution from x=-4 to x=-1.4 using y = Exp(-x^2/2)/√ (2π).

y-values are generated using the scipy function norm() which returns the value of the Gaussian Exp(-x^2/2)/√ (2π). This will generate (x,y) pairs along the curve of the Gaussian. However, we will need to add the (-1.4,0) coordinate so that the curve closes.

from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np

interval = 0.1

x = np.arange(-4, 4, interval)
y = norm.pdf(x)

# z-score
z = -1.4

# we want to shade from -inf to z or rather the interval [-x_min, z]
# generate x-coordinates: [-x_min,z]
xarea = np.arange(-4, z, interval) 
# generate y-coordinates
yarea = norm.pdf(xarea)  
# need to add the point (x,y): (z, 0) so that 
# polygon fills all the way to the x-axis
# append x=z-score
xarea = np.append(xarea, xarea[-1]) 
# append y=0
yarea = np.append(yarea, 0 )

plt.plot(x, y)

plt.fill(xarea, yarea)

plt.text(-1.4 , 0 , '-1.4' , {'color':'r'})

plt.show()

Maximize plots from within code

# show plot in a maximized window
mng = plt.get_current_fig_manager()
mng.full_screen_toggle()

References
https://matplotlib.org/gallery/text_labels_and_annotations/usetex_demo.html#sphx-glr-gallery-text-labels-and-annotations-usetex-demo-py

Plotting columns of a dataframe in python

March 26, 2019 ~ Saugata ~ Leave a comment

Every column of a dataframe can be plotted provided they contain numeric values. To demonstrate this we will generate a dataframe from S&P data pulled from Yahoo finance using pandas_datareader. Then we plot one of the columns of the dataframe – the column labeled ‘Adj Close’ (adjusted close). We will use pyplotlib to do the plotting but seaborn can also be used.

from matplotlib import pyplot as plt
import datetime
# pandas_datareader imports stock data from yahoo finance 
from pandas_datareader import data as web

start = datetime.datetime(2018,1,5)
end = datetime.datetime(2018,2,5)
# get some data from yahoo finance
spdrdata = web.DataReader('SPY',"yahoo",start,end)
# plot the column of the dataframe called 'Adj Close'
spdrdata['Adj Close'].plot()
# display with pyplot
plt.show()

spdrdata[‘Adj Close’].plot() selects the column and directly plots it. The plot() function automatically selects the index of the dataframe as the x-axis. The index for this particular dataframe are the dates.

>>> spdrdata.index
DatetimeIndex(['2018-01-05', '2018-01-08', '2018-01-09', '2018-01-10',
               '2018-01-11', '2018-01-12', '2018-01-16', '2018-01-17',
               '2018-01-18', '2018-01-19', '2018-01-22', '2018-01-23',
               '2018-01-24', '2018-01-25', '2018-01-26', '2018-01-29',
               '2018-01-30', '2018-01-31', '2018-02-01', '2018-02-02',
               '2018-02-05', '2018-02-06'],
              dtype='datetime64[ns]', name='Date', freq=None)

Figure_1

Sagemath won’t run in Linux (No module named ‘sage.repl’)

March 23, 2019March 31, 2019 ~ Saugata ~ 2 Comments

TLDR: Change #!/usr/bin/env python to #!/usr/bin/env python2 in the file /usr/share/sagemath/bin/sage-ipython

This error happens because the Python environment where Sage is running is set up to use a Python version other than Python 2.7.

If sagemath is installed using the Ubuntu repository (sudo apt-get sagemath) then it will install sagemath under python2.7. We can verify this from the Ubuntu repo

https://packages.ubuntu.com/bionic/amd64/sagemath/filelist

Or if we have already installed sagemath, by going to /usr/lib/python2.7/dist-packages/sage. So trying to run sage from a terminal will only give an error.

$ sage
Traceback (most recent call last):
File "/usr/share/sagemath/bin/sage-ipython", line 7, in
from sage.repl.interpreter import SageTerminalApp
ModuleNotFoundError: No module named 'sage.repl'

This is because sage is trying to run under a python version different than python2.7. We can verify this is the case.

$ which python
/usr/bin/python
$ ls -l /usr/bin/python
lrwxrwxrwx 1 root root 16 Mar 23 12:14 /usr/bin/python -> /usr/bin/python3

So the python environment is python3.6 and not python 2.7 (as required by sage). Sage doesn’t automatically select the right python version.

root@parton:/usr/share/sagemath/bin# cat sage-ipython 
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Sage IPython startup script.
"""

from sage.repl.interpreter import SageTerminalApp

app = SageTerminalApp.instance()
app.initialize()
app.start()

So sage is running ipython with the env set to find python. The env selects python which points to /usr/bin/python3.6. We can see what the current env will select.

$ type -a python
python is /usr/bin/python                   (<- python 3.6)
python is /home/jones/anaconda3/bin/python  (<- python 3.5)

For users running python3.5 with anaconda3 but having python2.7 installed system-wide, temporarily renaming anaconda3 or changing the $PATH variable to move anaconda3 to the end seems to works. I had no success with this and the reason is clear. When anaconda3 is removed from $PATH, the OS python env takes over which is python 3.6. So unless the OS environment is also python2.7, removing anacona3 will not solve the problem of sagemath not running.

There are two options available.

Create a new virtual environment with vitrualenv and install python2.7 and sagemath in that environment. This will turn out to be too much work and takes up lot of space.
Modify the sage-ipython file to use python2.7

We will modify /usr/share/sagemath/bin/sage-ipython

$ su
$ cd /usr/bin/
$ ls -l python*

Check the python link that points python2.7

lrwxrwxrwx 1 root root 9 Apr 16 2018 python2 -> python2.7

If there is none pointing to python2.7 then create a link

$ ln -s /usr/bin/python2.7 python2

Now let’s modify the sage-ipython file

$ cd /usr/share/sagemath/bin
$ gedit sage-ipython

Change #!/usr/bin/env python to #!/usr/bin/env python2

Save, log out of root, and run sage as a normal user. Sage should work now.

Sagemath jupyter server is crashing

Sagemath Jupyter GUI server crash is fixed by editing
/usr/share/jupyter/kernels/sagemath/kernel.json. See the post Jupyter notebook running the wrong python version.

/usr/share/jupyter/kernels/sagemath had the following kernel declaration for jupyter that was causing the kernel to crash.

{
"display_name": "SageMath 8.1", 
"argv": [
 "/usr/bin/sage",
 "--python",
 "-m",
 "sage.repl.ipython_kernel",
 "-f",
 "{connection_file}"
]
}

When the python was replaced with python2 it started working.

{
"display_name": "SageMath 8.1", 
"argv": [
 "/usr/bin/sage",
 "--python2",
 "-m",
 "sage.repl.ipython_kernel",
 "-f",
 "{connection_file}"
]
}

Convert all data in a table to a numeric data type in R

March 10, 2019March 10, 2019 ~ Saugata ~ Leave a comment

When tables are imported from CSV files they are usually imported as data.frames and the values are characters and not numeric. Hence it is impossible to use the table unless the data is converted to a numeric data type.

The conversion of a single variable to the numeric data type in R involves passing the variable to the function as.numeric().

var_in_numeric_dtype = as.numeric(var_in_char_dtype)

For tables (matrices), the as.numeric() function has to be recursively applied using the apply() function (I seem to have more control with apply() than sapply()).

> alg2 = apply(alg[,c(1:4)],c(1,2),as.numeric)
            Mean grade   Std Total students # of fails
X2016S1          77.00 14.00              5          1
X2016S2          74.00 14.00             11          3
X2016S3          85.00 12.00             20          1
X2017S1          72.00 21.00             22          5
X2017S2          57.45 38.28              7          3
X2018S1          73.91 21.56             17          3
X2018S2          83.20  6.62              9          0
X2018S3          69.98 22.44             14          4
Spring.2019      69.63 28.62             19          2

alg2 is a matrix consisting of numerical values on which we can perform mathematical operations. We can recalculate the column “fail rate” (which we had to drop as it was not numeric) now using columns 3 ( “# of fails”) and 4 (“Total students”).

> failrate = alg2[,4]/alg2[,3]*100
    X2016S1     X2016S2     X2016S3     X2017S1     X2017S2     X2018S1     X2018S2     X2018S3 Spring.2019 
   20.00000    27.27273     5.00000    22.72727    42.85714    17.64706     0.00000    28.57143    10.52632 


> alg = cbind(alg2, failrate)
            Mean grade   Std Total students # of fails failrate
X2016S1          77.00 14.00              5          1 20.00000
X2016S2          74.00 14.00             11          3 27.27273
X2016S3          85.00 12.00             20          1  5.00000
X2017S1          72.00 21.00             22          5 22.72727
X2017S2          57.45 38.28              7          3 42.85714
X2018S1          73.91 21.56             17          3 17.64706
X2018S2          83.20  6.62              9          0  0.00000
X2018S3          69.98 22.44             14          4 28.57143
Spring.2019      69.63 28.62             19          2 10.52632