Auto justify section headers with custom padding text

Let’s say we have a text file that we want to organize and have section headers. We want the headers to look well indented and visually separate sections. An example might look like this.

---========= DISPLAY CONTENTS OF DATABASES AND TABLES =========---

mysql> show tables;

---======================= MODIFY DATA ========================---

mysql> ALTER TABLE books MODIFY title VARCHAR(50) AFTER book_id;
mysql> UPDATE books SET genre=3 WHERE book_id=1;

Typing the extra = and centering the text might be a laborious process. A short python script could center the text and generate the header.

def f(a):
    print("---"+a.rjust(30+int(len(a)/2),"=").ljust(60,"=")+"---")

f(" SORT ROWS ")
---======================== SORT ROWS =========================---

The function accepts the header string. It first right justifies the string and pads(fills) the extra space in front with =. Then it left justifies and pads the extra space at the end with =.

a.rjust(30+int(len(a)/2),“=”)

Does the main work. If we have a string of length 11 and we want the header to be of length 60, then we need to split the text in half (size ~5), pad the front with 25 = symbols. Since the string has length 11, setting a [width=30+5] will help rjust() pad by (30+5-11=24) symbols (correct number). Left justify is simpler since we know we want the whole header to be of length = 60.

ljust(60,“=”)

Method overloading in python with no default parameters

Since methods are evaluated at declaration time and not at run time the default values for method parameters cannot be dynamically set at declaration. Which means we cannot overload a function call like we can do in C++ such as this

int func()
int func(int c)

However, the workaround is to set the argument parameters to None, which allows the method to be called without any arguments. For example, the method

  def getdata(self, startdate, enddate):

when called as

s = obj.getdata()

Will raise an exception

TypeError: getdata() missing 2 required positional arguments: ‘startdate’ and ‘enddate’

We need to modify the declaration to

  def getdata(self, startdate=None, enddate=None):
       if startdate is None:
           startdate = self.stockdata.index[0]
       if enddate is None:
           enddate = self.stockdata.index[-1]

That will define the method as taking one, two, or no arguments. When called without arguments the if statements will inject the parameters dynamically during runtime.

Reference:
Using self.xxxx as a default parameter in a class method

 

Convert column to a float from str dropping non-numeric strings

Let us say we have the following dataframe

df[‘Amount $’]

0.07
1.154
2.596
X-Links
Amount $
0.102

And we want to convert all numbers to float and drop the non-numeric rows. isnumeric() will not work since this data is all str dtype. The only option is to write a small function which tries to convert a string to a float. If it fails it returns FALSE. If this function is mapped to the entire column using a lambda function then it will return a boolean list(series) where TRUE means float and FALSE means non-float. When this is used as a boolean mask on the dataframe, it will filter out the non-numeric rows.

def tryfloat(self, f):
       try:
           float(f)
           return True

       except ValueError:
           return False

df[ df['Amount $'].apply(lambda x: tryfloat(x)) ]

 

Result is this table

0.07
1.154
2.596
0.102

Barplots in R

Basic barplot

To draw the basic barplots we use the in-built data table mtcars.

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If we wish to plot the frequency of gears found in the database we first need to create a frequency table by aggregating all the unique values of the “gear” column. table() helps aggregate unique values and their frequencies.

> tmp=table(mtcars$gear)
3 4 5
15 12 5
> barplot(tmp)

Grouped barplot

The grouped barplot takes more effort.

> tmp=table(mtcars$cyl, mtcars$gear)

3 4 5
4 1 8 2
6 2 4 1
8 12 0 2

> barplot(tmp,beside = T, legend = paste(rownames(tmp),"cylinders"), 
  col = c(1:length(rownames(tmp))), xlab = "Gears/cylinders", 
  ylab = "Count")

rownames(tmp) extracts unique values of the column “cyl”(cylinders). col cycles through colors. Legend takes the rownames(tmp) as an argument but we want to append the string cylinders so that it is clear what the numbers signify. That is achieved by the paste() command which “pastes” the string “cylinders” at the end.

References:
https://www.statmethods.net/graphs/bar.html

Plotting multiple graphs in R

There are two ways to plotting multiple graphs in R.

  1. Create a grid of separate graphs on the same plot
  2. Overlay all graphs in a single frame of one plot

Create a grid of separate graphs on the same plot

A grid in R is created using the par() command. An example where we plot 4 graphs in a single plot.

> x=seq(-4,4,.1)
> par(mfrow=c(2,2))
> plot(x,dt(x,3),'l',col='red')
> plot(x,dt(x,5),'l',col='blue')
> plot(x,dt(x,10),'l',col='black')
> plot(x,dnorm(x),'l',col='magenta')

We generate a vector of values between (-4,+4) called ‘x’ and then feed that into the point distribution function (pdf) of Students t-distribution – df(z,df) – which accepts the z-score and the degrees of freedom (df) to generate the pdf. We plot 3 graphs with df = 3, 5, 10. And then we plot a pdf of the Normal Distribution – pnorm(z). Students t-distribution makes more sense with smaller sample sizes and probably with stock prices which tend to have extreme events.

The 2×2 plots – 4 graphs in 2 rows is made possible by the command par(mfrow=c(2,2)). For a 2×1 matrix of plots, we would have used par(mfrow=(c2,1)).

Rplot2

The graphs all quite different since the Students t-distribution has fatter tails than the Normal Distribution. However, that is not very apparent from the side-by-side graphs. We will need to overlay the graphs on top of each other to see the tiny differences in the graphs.

To return to regular single plot grid use

> par(mfrow=c(1,1))

Overlay all graphs in a single frame of one plot

To overlay multiple plots on each other we start by plotting the first graph with all its axes and labels. The only restriction is that we will need to limit the y-interval between specific values (if x is the independent variable, if y is the independent variable then will need to restrict x-interval). ylim=c(0,0.4) restricts the plot between the y-interval (0,0.4). 

> par(mfrow=c(1,1))
> plot(x,dnorm(x),'l',col=1,ylim=c(0,0.4))

Rplot1

Now, we overlay the rest of the plots over this one by looping over pdf of the Students t-distribution with df = 3,5,10 (which we plotted in the previous section).

> for (val in c(3,5,10)){
+     par(new=T)
+     plot(x,dt(x,val),'l',col=val,axes=F,xlab="",ylab="")
+ }

The overlay is achieved using the par(new=T) command (same as par(new=TRUE)). It tells R that we are plotting a new graph on the same space. We also needed to turn off the axes (axes=F) and the labels for the x-axes and the y-axes (xlab=””,ylab=””). Otherwise, the text of different graphs will overlay on each other and make it look messy. In the resulting graph, we can see the fat tails of the Students t-distribution.Rplot4

Dataframe manipulation with pandas

Merge databases

db1 = pd.DataFrame({'Name':['Jones','Will','Rory','Bryce','Hok'],
 'job_id':[2,5,3,7,2]}, index=[1,2,3,4,5])



db2 = pd.DataFrame({'Name':['CEO','Chairman','Vice-Chairman',
'Senior Engineer'], 'job_id':[5,1,2,3]}, index=[1,2,3,4])

df = pd.merge(db1,db2,on='job_id')
Name_x  job_id        Name_y
0  Jones       2 Vice-Chairman
1    Hok     2 Vice-Chairman
2   Will       5    CEO
3   Rory       3 Senior Engineer

merge() automatically removes the rows which contain null placeholder values similar to inner join and renames the columns appropriately.

https://pandas.pydata.org/pandas-docs/stable/merging.html

Extracting rows from a dataframe by row number using iloc

>>> df.iloc[2]
Name_x    Will
job_id       5
Name_y     CEO
Name: 2, dtype: object

Extracting rows which match a string value

Syntax: df[ ( df[‘col’] == “some value” )  ]

(hpi[‘RegionName’] == “Mesa”) generates a Boolean set which can then be used to extract the rows which are True from hpi[]. Note that the ( ) are crucial to the operation of converting it to a set.

# select all rows where the RegionName is "Mesa"
mesadataall = hpi[ (hpi['RegionName'] == "Mesa")  ]

Cleaning databases using replace()

# clean data with sed like REGEX
# remove all (2014) references
moddata.replace(r" \(.*\)", "", inplace=True, regex=True) 
# replace the word unavailable by 0 
moddata.replace(r"unavailable", "0", inplace=True, regex=True) 

 

These REGEX clean the data by removing non-numeric data and replacing them by 0.

Web scrape tables from website using pandas

data = pd.read_html(
'https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate')
# entire HTML is imported as a list
# the table in is the fourth element of the list

df = data[4]

 

to be continued …

Jupyter notebook running the wrong python version

When multiple versions of python are installed in a system along with anaconda3, jupyter kernels might run the wrong python version. Here is an example.

jupyter1.png

When we start the Python 2 kernel explicitly from the drop-down menu, we expect Jupyter to be running Python 2. But that is not the case as verified below.

jupyter2.png

The culprit is the kernel.json file in the jupyter kernel folder at /usr/share/jupyter/kernels/python2.

jupyter3.png

The kernel.json file asks jupyter to run /usr/bin/python. But /usr/bin/python points to python3 and not python2. Therefore, jupyter ends up running python 3 . We will need to replace python with python2 in the kernel.json file (/usr/share/jupyter/kernels/python2).

{
"argv": [
"/usr/bin/python2", 
"-m", 
"ipykernel_launcher", 
"-f", 
"{connection_file}"
], 
"display_name": "Python 2", 
"language": "python"
}

Save the file and restart jupyter.

I had to repeat a similar procedure with the Sagemath Jupyter kernel. /usr/share/jupyter/kernels/sagemath had the following kernel declaration for jupyter that was causing the kernel to crash.

{
"display_name": "SageMath 8.1", 
"argv": [
 "/usr/bin/sage",
 "--python",
 "-m",
 "sage.repl.ipython_kernel",
 "-f",
 "{connection_file}"
]
}

When the python was replaced with python2 it started working.

{
"display_name": "SageMath 8.1", 
"argv": [
 "/usr/bin/sage",
 "--python2",
 "-m",
 "sage.repl.ipython_kernel",
 "-f",
 "{connection_file}"
]
}

For Sagemath installed from source this will never happen. However, if Sagemath was installed from the repositories (sudo apt install sagemath sagemath-common) then this error is inevitable due to the dependency of sagemath python 2 (see my article https://bytesofcomputerwisdom.home.blog/2019/03/23/sagemathwont-run-no-module-named-sage-repl/).

How to search python interpreter history

To search the python interpreter history we can use the following code as it is with the string search_string replaced by string to be searched.

>>> import readline
>>> for i in range(readline.get_current_history_length()):
…       x=readline.get_history_item(i+1)
…       if “search_string” in x:
…          print(x)

Note: The indentation of the code is crucial to its function.

Draw filled plots in python

Matplotlib fill function can fill a polygon given the coordinates of its vertices. See the first example in the following link.

https://matplotlib.org/gallery/lines_bars_and_markers/fill.html

This method can be extended to draw filled curves or shade the area under a curve. For example, if we want to shade the area under a normal distribution up to the z-score -1.4 like thisFigure_3.png

then we will first generate a list of (x,y) coordinate pairs along the curve of the distribution from x=-4 to x=-1.4 using y = Exp(-x^2/2)/√ (2π).

y-values are generated using the scipy function norm() which returns the value of the Gaussian Exp(-x^2/2)/√ (2π). This will generate (x,y) pairs along the curve of the Gaussian. However, we will need to add the (-1.4,0) coordinate so that the curve closes. 

from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np

interval = 0.1

x = np.arange(-4, 4, interval)
y = norm.pdf(x)

# z-score
z = -1.4

# we want to shade from -inf to z or rather the interval [-x_min, z]
# generate x-coordinates: [-x_min,z]
xarea = np.arange(-4, z, interval) 
# generate y-coordinates
yarea = norm.pdf(xarea)  
# need to add the point (x,y): (z, 0) so that 
# polygon fills all the way to the x-axis
# append x=z-score
xarea = np.append(xarea, xarea[-1]) 
# append y=0
yarea = np.append(yarea, 0 )

plt.plot(x, y)

plt.fill(xarea, yarea)

plt.text(-1.4 , 0 , '-1.4' , {'color':'r'})

plt.show()

 

Maximize plots from within code

# show plot in a maximized window
mng = plt.get_current_fig_manager()
mng.full_screen_toggle()

References
https://matplotlib.org/gallery/text_labels_and_annotations/usetex_demo.html#sphx-glr-gallery-text-labels-and-annotations-usetex-demo-py

Plotting columns of a dataframe in python

Every column of a dataframe can be plotted provided they contain numeric values. To demonstrate this we will generate a dataframe from S&P data pulled from Yahoo finance using pandas_datareader. Then we plot one of the columns of the dataframe – the column labeled ‘Adj Close’ (adjusted close). We will use pyplotlib to do the plotting but seaborn can also be used.

from matplotlib import pyplot as plt
import datetime
# pandas_datareader imports stock data from yahoo finance 
from pandas_datareader import data as web

start = datetime.datetime(2018,1,5)
end = datetime.datetime(2018,2,5)
# get some data from yahoo finance
spdrdata = web.DataReader('SPY',"yahoo",start,end)
# plot the column of the dataframe called 'Adj Close'
spdrdata['Adj Close'].plot()
# display with pyplot
plt.show()

 

spdrdata[‘Adj Close’].plot() selects the column and directly plots it. The plot() function automatically selects the index of the dataframe as the x-axis. The index for this particular dataframe are the dates. 

>>> spdrdata.index
DatetimeIndex(['2018-01-05', '2018-01-08', '2018-01-09', '2018-01-10',
               '2018-01-11', '2018-01-12', '2018-01-16', '2018-01-17',
               '2018-01-18', '2018-01-19', '2018-01-22', '2018-01-23',
               '2018-01-24', '2018-01-25', '2018-01-26', '2018-01-29',
               '2018-01-30', '2018-01-31', '2018-02-01', '2018-02-02',
               '2018-02-05', '2018-02-06'],
              dtype='datetime64[ns]', name='Date', freq=None)

Figure_1