Last active
September 9, 2019 05:05
-
-
Save vikas-git/67816c40e2299af2faa8f94361e3b53b to your computer and use it in GitHub Desktop.
In this gist trying to explain how to use sql queries on dataframes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Source Blog: https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e?source=search_post---------0 | |
-> Sql queries vs pandas queries | |
* select Query | |
-> select * from airports; | |
-> airports.head() | |
* select query with limit | |
-> select * from airports limit 3 | |
-> airports.head(3) | |
* Get number of rows and columns | |
-> airports.shape // return (55536, 18) | |
* select certain column with where clause | |
-> select id from airports where ident='KLAX' | |
-> airports[airports.ident=='KLAX'].id | |
* select with multiple conditions | |
-> select * from airports where iso_region = 'US-CA' and type = 'seaplane_base' | |
-> airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')] | |
-> select ident, name, municipality from airports where iso_region = 'US-CA' and type = 'large_airport' | |
-> airports[(airports.iso_region == 'US-CA') & (airports.type == 'large_airport')][['ident', 'name', 'municipality']] | |
* ORDER BY | |
-> select * from airport_freq where airport_ident = 'KLAX' order by type | |
-> airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type') | |
-> select * from airport_freq where airport_ident = 'KLAX' order by type desc | |
-> airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False) | |
* IN...NOT IN | |
-> select * from airports where type in ('heliport', 'balloonport') | |
-> airports[airports.type.isin(['heliport', 'balloonport'])] | |
-> select * from airports where type not in ('heliport', 'balloonport') | |
-> airports[~airports.type.isin(['heliport', 'balloonport'])] | |
* Groupby, count, ORDER BY | |
-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type | |
-> airports.groupby(['iso_country', 'type']).size() | |
-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc | |
-> airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False]) | |
-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type | |
-> airports.groupby(['iso_country', 'type']).size() | |
-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc | |
-> airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False]) | |
* HAVING | |
-> select type, count(*) from airports where iso_country = 'US' group by type having count(*) > 1000 order by count(*) desc | |
-> airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False) | |
* TOP N Records | |
-> airports.head() // by default it returns 5 | |
-> airports.head(3) | |
-> select iso_country from by_country order by size desc limit 10 | |
-> by_country.nlargest(10, columns='airport_count') | |
-> select iso_country from by_country order by size desc limit 10 offset 10 | |
-> by_country.nlargest(20, columns='airport_count').tail(10) | |
* Aggregate function (MIN, MAX, MEAN, MEDIAN) | |
-> select max(length_ft), min(length_ft), mean(length_ft), median(length_ft) from runways | |
-> runways.agg({'length_ft': ['min', 'max', 'mean', 'median']}) | |
OR | |
-> df.T | |
* Join | |
Short note : Use .merge() to join Pandas dataframes. You need to provide which columns to join | |
on (left_on and right_on), and join type: inner (default), left (corresponds to LEFT OUTER in SQL), | |
right (RIGHT OUTER), or outer (FULL OUTER). | |
->select airport_ident, type, description, frequency_mhz from airport_freq join airports on | |
airport_freq.airport_ref = airports.id where airports.ident = 'KLAX' | |
-> airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], left_on='airport_ref', | |
right_on='id', how='inner')[['airport_ident', 'type', 'description', 'frequency_mhz']] | |
* Union All and Union | |
-> select name, municipality from airports where ident = 'KLAX' | |
union all | |
select name, municipality from airports where ident = 'KLGB' | |
-> pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']], | |
airports[airports.ident == 'KLGB'][['name', 'municipality']]]) | |
^Note : (To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates().) | |
* Insert query | |
-> create table heroes (id integer, name text); | |
-> insert into heroes values (1, 'Harry Potter'); | |
-> insert into heroes values (2, 'Ron Weasley'); | |
-> insert into heroes values (3, 'Hermione Granger'); | |
-> df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']}) | |
-> df2 = pd.DataFrame({'id': [3], 'name': ['Hermione Granger']}) | |
-> pd.concat([df1, df2]).reset_index(drop=True) | |
* Update | |
-> update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' where ident == 'KLAX' | |
-> airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx' | |
* Delete | |
-> delete from lax_freq where type = 'MISC' | |
-> lax_freq = lax_freq[lax_freq.type != 'MISC'] | |
-> lax_freq.drop(lax_freq[lax_freq.type == 'MISC'].index) | |
* or added a new calculated column: | |
-> df['total_cost'] = df['price'] * df['quantity'] | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment