How to Rewrite SQL Queries in Pandas
If you’re reading this, you’re probably a data scientist already familiar with SQL and Pandas. But if you’re like me, you scratch your head each time you wish to write a SQL query against your Pandas DataFrame.
This article attempts to provide a mapping between common operations in SQL and their counterpart in Pandas using a fictional housing dataset.
Select, order by, and limit
Say you want to select all columns in your dataset and limit the number of rows coming back. In SQL,
select * from housing limit 10
Want to order by number of bedrooms? In SQL,
select * from housing order by num_bedrooms limit 10
Just want to select a few columns? In SQL,
select num_bedrooms, price from housing
Filtering with where
Only care about houses with three or more bedrooms? In SQL,
select * from housing where num_bedrooms >= 3
housing_df[housing_df.num_bedrooms >= 3]
Want multiple filters? In SQL,
select * from housing where num_bedrooms >= 3 and price < 450000
housing_df[(housing_df.num_bedrooms >= 3) & (housing_df.price < 450000)]
Or what if you want to filter rows by values in a column?
select * from housing where building_material in ('brick', 'wood')
Want rows not in the list?
I’ve found myself many times in the situation where I want to join two tables together on a particular column so that I have as many features available for later analysis. In SQL, this has been relativey easy:
select * from housing left outer join more_housing on housing.house_id = more_housing.house_id
pd.merge(housing_df, more_housing_df, on='house_id', how='left')
Joining on multiple columns between both tables? In SQL,
select * from housing left outer join more_housing on housing.house_id = more_housing.house_id and housing.price = more_housing.price
pd.merge(housing_df, more_housing_df, on=['house_id', 'price'], how='left')
Want to stack or union two tables? Let’s get a list of all housing prices from both tables.
select price from housing union select price from more_housing
Group by and having
Say we want to count how many houses there are when grouped by
select num_bedrooms, is_two_story, count(*) from housing group by num_bedrooms, is_two_story
Want to sort those values now?
housing.groupby(['num_bedrooms', 'is_two_story']).size().to_frame('size').reset_index().sort_values(['num_bedrooms', 'size'])
Or what if you just want to consider groups where the count is greater than 5?
housing.groupby(['num_bedrooms', 'is_two_story']).filter(lambda x: len(x) > 5).groupby(['num_bedrooms', 'is_two_story']).size().to_frame('size').reset_index().sort_values(['num_bedrooms', 'size'])
A note about multiple indices
Say you’re doing some quick analysis on COVID-19 data and want to get the total number of confirmed cases over time for the US. You’d start with:
us_df = df[df['Country'] == 'US'].groupby(['Country', 'Last Updated']).sum()[['Confirmed']]
The problem is that now the dataframe as two indices and this can make it hard to plot the data. We can unstack the first column in the groupby,
Country, to leave us with the
Last Updated column as an index:
us_df = us_df.unstack(level=0)
Now we just rename the only column we have, add the index as a new column, then just replace the index with the row count.
us_df.columns = ['Confirmed'] us_df['Last Updated'] = us_df.index us_df.index = np.arange(us_df.shape)