Use pandas to convert a date to datetime format

Importing dates from a CSV file is always a hassle. With myriads of DateTime formats possible, we will need to write extensive amounts of code to accommodate al possible DateTime formats or put restrictions on the contents of the CSV file. We don’t want to do either. Instead of hard-coding commands like

map(datetime.strftime(string, “%m/%d/%Y))

into our codes, we can use pandas to convert the dates for us. Pandas has the capability to convert an entire column of dates in string format to DateTime format. We just need to be careful when importing just dates and not DateTime objects(strings). Pandas usually converts to DateTime objects. If we are just importing dates then the time components are undesirable. We will need to strip off the time part using .date() at the end. So instead of

pd.to_datetime(date)

we will need to use

pd.to_datetime(date).date()

An example script illustrates this procedure.

def dateformat(self, date):
    # use pandas to convert a date to datetime format
    # extract just the date since pandas returns the date as Timestamp object
    # repack the date as datetime using datetime.datetime.combine() with time = 00:00

    date = dt.datetime.combine(pd.to_datetime(date).date(), 
                               dt.datetime.min.time())
    return date

 

Timestamp error

TypeError: an integer is required (got type Timestamp)


This error is common when a date is extracted from a pandas dataframe datetime column or index. We need to extract the date from the Timestamp object by using the .date() method:

startdate
Timestamp('2019-10-28 05:00:00')

startdate.date()
datetime.date(2019, 10, 28)

If we need to convert to a datetime format then we need to combine the date and the time portion like this

startdate = dt.datetime.combine(startdate.date(), startdate.time())


dt.datetime.combine() combines the date and the time part together to form a datetime object.

Auto justify section headers with custom padding text

Let’s say we have a text file that we want to organize and have section headers. We want the headers to look well indented and visually separate sections. An example might look like this.

---========= DISPLAY CONTENTS OF DATABASES AND TABLES =========---

mysql> show tables;

---======================= MODIFY DATA ========================---

mysql> ALTER TABLE books MODIFY title VARCHAR(50) AFTER book_id;
mysql> UPDATE books SET genre=3 WHERE book_id=1;

Typing the extra = and centering the text might be a laborious process. A short python script could center the text and generate the header.

def f(a):
    print("---"+a.rjust(30+int(len(a)/2),"=").ljust(60,"=")+"---")

f(" SORT ROWS ")
---======================== SORT ROWS =========================---

The function accepts the header string. It first right justifies the string and pads(fills) the extra space in front with =. Then it left justifies and pads the extra space at the end with =.

a.rjust(30+int(len(a)/2),“=”)

Does the main work. If we have a string of length 11 and we want the header to be of length 60, then we need to split the text in half (size ~5), pad the front with 25 = symbols. Since the string has length 11, setting a [width=30+5] will help rjust() pad by (30+5-11=24) symbols (correct number). Left justify is simpler since we know we want the whole header to be of length = 60.

ljust(60,“=”)

Method overloading in python with no default parameters

Since methods are evaluated at declaration time and not at run time the default values for method parameters cannot be dynamically set at declaration. Which means we cannot overload a function call like we can do in C++ such as this

int func()
int func(int c)

However, the workaround is to set the argument parameters to None, which allows the method to be called without any arguments. For example, the method

  def getdata(self, startdate, enddate):

when called as

s = obj.getdata()

Will raise an exception

TypeError: getdata() missing 2 required positional arguments: ‘startdate’ and ‘enddate’

We need to modify the declaration to

  def getdata(self, startdate=None, enddate=None):
       if startdate is None:
           startdate = self.stockdata.index[0]
       if enddate is None:
           enddate = self.stockdata.index[-1]

That will define the method as taking one, two, or no arguments. When called without arguments the if statements will inject the parameters dynamically during runtime.

Reference:
Using self.xxxx as a default parameter in a class method

 

Convert column to a float from str dropping non-numeric strings

Let us say we have the following dataframe

df[‘Amount $’]

0.07
1.154
2.596
X-Links
Amount $
0.102

And we want to convert all numbers to float and drop the non-numeric rows. isnumeric() will not work since this data is all str dtype. The only option is to write a small function which tries to convert a string to a float. If it fails it returns FALSE. If this function is mapped to the entire column using a lambda function then it will return a boolean list(series) where TRUE means float and FALSE means non-float. When this is used as a boolean mask on the dataframe, it will filter out the non-numeric rows.

def tryfloat(self, f):
       try:
           float(f)
           return True

       except ValueError:
           return False

df[ df['Amount $'].apply(lambda x: tryfloat(x)) ]

 

Result is this table

0.07
1.154
2.596
0.102

Dataframe manipulation with pandas

Merge databases

db1 = pd.DataFrame({'Name':['Jones','Will','Rory','Bryce','Hok'],
 'job_id':[2,5,3,7,2]}, index=[1,2,3,4,5])



db2 = pd.DataFrame({'Name':['CEO','Chairman','Vice-Chairman',
'Senior Engineer'], 'job_id':[5,1,2,3]}, index=[1,2,3,4])

df = pd.merge(db1,db2,on='job_id')
Name_x  job_id        Name_y
0  Jones       2 Vice-Chairman
1    Hok     2 Vice-Chairman
2   Will       5    CEO
3   Rory       3 Senior Engineer

merge() automatically removes the rows which contain null placeholder values similar to inner join and renames the columns appropriately.

https://pandas.pydata.org/pandas-docs/stable/merging.html

Extracting rows from a dataframe by row number using iloc

>>> df.iloc[2]
Name_x    Will
job_id       5
Name_y     CEO
Name: 2, dtype: object

Extracting rows which match a string value

Syntax: df[ ( df[‘col’] == “some value” )  ]

(hpi[‘RegionName’] == “Mesa”) generates a Boolean set which can then be used to extract the rows which are True from hpi[]. Note that the ( ) are crucial to the operation of converting it to a set.

# select all rows where the RegionName is "Mesa"
mesadataall = hpi[ (hpi['RegionName'] == "Mesa")  ]

Cleaning databases using replace()

# clean data with sed like REGEX
# remove all (2014) references
moddata.replace(r" \(.*\)", "", inplace=True, regex=True) 
# replace the word unavailable by 0 
moddata.replace(r"unavailable", "0", inplace=True, regex=True) 

 

These REGEX clean the data by removing non-numeric data and replacing them by 0.

Web scrape tables from website using pandas

data = pd.read_html(
'https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate')
# entire HTML is imported as a list
# the table in is the fourth element of the list

df = data[4]

 

to be continued …

Jupyter notebook running the wrong python version

When multiple versions of python are installed in a system along with anaconda3, jupyter kernels might run the wrong python version. Here is an example.

jupyter1.png

When we start the Python 2 kernel explicitly from the drop-down menu, we expect Jupyter to be running Python 2. But that is not the case as verified below.

jupyter2.png

The culprit is the kernel.json file in the jupyter kernel folder at /usr/share/jupyter/kernels/python2.

jupyter3.png

The kernel.json file asks jupyter to run /usr/bin/python. But /usr/bin/python points to python3 and not python2. Therefore, jupyter ends up running python 3 . We will need to replace python with python2 in the kernel.json file (/usr/share/jupyter/kernels/python2).

{
"argv": [
"/usr/bin/python2", 
"-m", 
"ipykernel_launcher", 
"-f", 
"{connection_file}"
], 
"display_name": "Python 2", 
"language": "python"
}

Save the file and restart jupyter.

I had to repeat a similar procedure with the Sagemath Jupyter kernel. /usr/share/jupyter/kernels/sagemath had the following kernel declaration for jupyter that was causing the kernel to crash.

{
"display_name": "SageMath 8.1", 
"argv": [
 "/usr/bin/sage",
 "--python",
 "-m",
 "sage.repl.ipython_kernel",
 "-f",
 "{connection_file}"
]
}

When the python was replaced with python2 it started working.

{
"display_name": "SageMath 8.1", 
"argv": [
 "/usr/bin/sage",
 "--python2",
 "-m",
 "sage.repl.ipython_kernel",
 "-f",
 "{connection_file}"
]
}

For Sagemath installed from source this will never happen. However, if Sagemath was installed from the repositories (sudo apt install sagemath sagemath-common) then this error is inevitable due to the dependency of sagemath python 2 (see my article https://bytesofcomputerwisdom.home.blog/2019/03/23/sagemathwont-run-no-module-named-sage-repl/).

How to search python interpreter history

To search the python interpreter history we can use the following code as it is with the string search_string replaced by string to be searched.

>>> import readline
>>> for i in range(readline.get_current_history_length()):
…       x=readline.get_history_item(i+1)
…       if “search_string” in x:
…          print(x)

Note: The indentation of the code is crucial to its function.

Draw filled plots in python

Matplotlib fill function can fill a polygon given the coordinates of its vertices. See the first example in the following link.

https://matplotlib.org/gallery/lines_bars_and_markers/fill.html

This method can be extended to draw filled curves or shade the area under a curve. For example, if we want to shade the area under a normal distribution up to the z-score -1.4 like thisFigure_3.png

then we will first generate a list of (x,y) coordinate pairs along the curve of the distribution from x=-4 to x=-1.4 using y = Exp(-x^2/2)/√ (2π).

y-values are generated using the scipy function norm() which returns the value of the Gaussian Exp(-x^2/2)/√ (2π). This will generate (x,y) pairs along the curve of the Gaussian. However, we will need to add the (-1.4,0) coordinate so that the curve closes. 

from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np

interval = 0.1

x = np.arange(-4, 4, interval)
y = norm.pdf(x)

# z-score
z = -1.4

# we want to shade from -inf to z or rather the interval [-x_min, z]
# generate x-coordinates: [-x_min,z]
xarea = np.arange(-4, z, interval) 
# generate y-coordinates
yarea = norm.pdf(xarea)  
# need to add the point (x,y): (z, 0) so that 
# polygon fills all the way to the x-axis
# append x=z-score
xarea = np.append(xarea, xarea[-1]) 
# append y=0
yarea = np.append(yarea, 0 )

plt.plot(x, y)

plt.fill(xarea, yarea)

plt.text(-1.4 , 0 , '-1.4' , {'color':'r'})

plt.show()

 

Maximize plots from within code

# show plot in a maximized window
mng = plt.get_current_fig_manager()
mng.full_screen_toggle()

References
https://matplotlib.org/gallery/text_labels_and_annotations/usetex_demo.html#sphx-glr-gallery-text-labels-and-annotations-usetex-demo-py

Plotting columns of a dataframe in python

Every column of a dataframe can be plotted provided they contain numeric values. To demonstrate this we will generate a dataframe from S&P data pulled from Yahoo finance using pandas_datareader. Then we plot one of the columns of the dataframe – the column labeled ‘Adj Close’ (adjusted close). We will use pyplotlib to do the plotting but seaborn can also be used.

from matplotlib import pyplot as plt
import datetime
# pandas_datareader imports stock data from yahoo finance 
from pandas_datareader import data as web

start = datetime.datetime(2018,1,5)
end = datetime.datetime(2018,2,5)
# get some data from yahoo finance
spdrdata = web.DataReader('SPY',"yahoo",start,end)
# plot the column of the dataframe called 'Adj Close'
spdrdata['Adj Close'].plot()
# display with pyplot
plt.show()

 

spdrdata[‘Adj Close’].plot() selects the column and directly plots it. The plot() function automatically selects the index of the dataframe as the x-axis. The index for this particular dataframe are the dates. 

>>> spdrdata.index
DatetimeIndex(['2018-01-05', '2018-01-08', '2018-01-09', '2018-01-10',
               '2018-01-11', '2018-01-12', '2018-01-16', '2018-01-17',
               '2018-01-18', '2018-01-19', '2018-01-22', '2018-01-23',
               '2018-01-24', '2018-01-25', '2018-01-26', '2018-01-29',
               '2018-01-30', '2018-01-31', '2018-02-01', '2018-02-02',
               '2018-02-05', '2018-02-06'],
              dtype='datetime64[ns]', name='Date', freq=None)

Figure_1