Dataframe manipulation with pandas

Merge databases

db1 = pd.DataFrame({'Name':['Jones','Will','Rory','Bryce','Hok'],
 'job_id':[2,5,3,7,2]}, index=[1,2,3,4,5])



db2 = pd.DataFrame({'Name':['CEO','Chairman','Vice-Chairman',
'Senior Engineer'], 'job_id':[5,1,2,3]}, index=[1,2,3,4])

df = pd.merge(db1,db2,on='job_id')
Name_x  job_id        Name_y
0  Jones       2 Vice-Chairman
1    Hok     2 Vice-Chairman
2   Will       5    CEO
3   Rory       3 Senior Engineer

merge() automatically removes the rows which contain null placeholder values similar to inner join and renames the columns appropriately.

https://pandas.pydata.org/pandas-docs/stable/merging.html

Extracting rows from a dataframe by row number using iloc

>>> df.iloc[2]
Name_x    Will
job_id       5
Name_y     CEO
Name: 2, dtype: object

Extracting rows which match a string value

Syntax: df[ ( df[‘col’] == “some value” )  ]

(hpi[‘RegionName’] == “Mesa”) generates a Boolean set which can then be used to extract the rows which are True from hpi[]. Note that the ( ) are crucial to the operation of converting it to a set.

# select all rows where the RegionName is "Mesa"
mesadataall = hpi[ (hpi['RegionName'] == "Mesa")  ]

Cleaning databases using replace()

# clean data with sed like REGEX
# remove all (2014) references
moddata.replace(r" \(.*\)", "", inplace=True, regex=True) 
# replace the word unavailable by 0 
moddata.replace(r"unavailable", "0", inplace=True, regex=True) 

 

These REGEX clean the data by removing non-numeric data and replacing them by 0.

Web scrape tables from website using pandas

data = pd.read_html(
'https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate')
# entire HTML is imported as a list
# the table in is the fourth element of the list

df = data[4]

 

to be continued …

Leave a comment