Merge databases
db1 = pd.DataFrame({'Name':['Jones','Will','Rory','Bryce','Hok'],
'job_id':[2,5,3,7,2]}, index=[1,2,3,4,5])
db2 = pd.DataFrame({'Name':['CEO','Chairman','Vice-Chairman',
'Senior Engineer'], 'job_id':[5,1,2,3]}, index=[1,2,3,4])
df = pd.merge(db1,db2,on='job_id')
| Name_x | job_id | Name_y | |
| 0 | Jones | 2 | Vice-Chairman |
| 1 | Hok | 2 | Vice-Chairman |
| 2 | Will | 5 | CEO |
| 3 | Rory | 3 | Senior Engineer |
merge() automatically removes the rows which contain null placeholder values similar to inner join and renames the columns appropriately.
https://pandas.pydata.org/pandas-docs/stable/merging.html
Extracting rows from a dataframe by row number using iloc
>>> df.iloc[2] Name_x Will job_id 5 Name_y CEO Name: 2, dtype: object
Extracting rows which match a string value
Syntax: df[ ( df[‘col’] == “some value” ) ]
(hpi[‘RegionName’] == “Mesa”) generates a Boolean set which can then be used to extract the rows which are True from hpi[]. Note that the ( ) are crucial to the operation of converting it to a set.
# select all rows where the RegionName is "Mesa" mesadataall = hpi[ (hpi['RegionName'] == "Mesa") ]
Cleaning databases using replace()
# clean data with sed like REGEX # remove all (2014) references moddata.replace(r" \(.*\)", "", inplace=True, regex=True) # replace the word unavailable by 0 moddata.replace(r"unavailable", "0", inplace=True, regex=True)
These REGEX clean the data by removing non-numeric data and replacing them by 0.
Web scrape tables from website using pandas
data = pd.read_html( 'https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate') # entire HTML is imported as a list # the table in is the fourth element of the list df = data[4]
to be continued …