How to Rewrite SQL Queries in Pandas

April 1, 2020

Kung Fu Panda Cover Image — Kung Fu Panda

If you’re reading this, you’re probably a data scientist already familiar with SQL and Pandas. But if you’re like me, you scratch your head each time you wish to write a SQL query against your Pandas DataFrame.

This article attempts to provide a mapping between common operations in SQL and their counterpart in Pandas using a fictional housing dataset.

Select, order by, and limit

Say you want to select all columns in your dataset and limit the number of rows coming back. In SQL,

select *
from housing
limit 10

In Pandas,

housing_df.head(10)

Want to order by number of bedrooms? In SQL,

select *
from housing
order by num_bedrooms
limit 10

In Pandas,

housing_df.sort_values(by=['num_bedrooms']).head(10)

Just want to select a few columns? In SQL,

select num_bedrooms, price
from housing

In Pandas,

housing_df[['num_bedrooms', 'price']]

Filtering with where

Only care about houses with three or more bedrooms? In SQL,

select *
from housing
where num_bedrooms >= 3

In Pandas,

housing_df[housing_df.num_bedrooms >= 3]

Want multiple filters? In SQL,

select *
from housing
where num_bedrooms >= 3 and price < 450000

In Pandas,

housing_df[(housing_df.num_bedrooms >= 3) & (housing_df.price < 450000)]

Or what if you want to filter rows by values in a column?

select *
from housing
where building_material in ('brick', 'wood')

In Pandas,

housing_df[housing_df.building_material.isin(['brick', 'wood'])]

Want rows not in the list?

housing_df[~housing_df.building_material.isin(['brick', 'wood'])]

Joins

I’ve found myself many times in the situation where I want to join two tables together on a particular column so that I have as many features available for later analysis. In SQL, this has been relativey easy:

select *
from housing
left outer join more_housing
on housing.house_id = more_housing.house_id

In Pandas,

pd.merge(housing_df, more_housing_df, on='house_id', how='left')

Joining on multiple columns between both tables? In SQL,

select *
from housing
left outer join more_housing
on housing.house_id = more_housing.house_id
   and housing.price = more_housing.price

In Pandas,

pd.merge(housing_df, more_housing_df, on=['house_id', 'price'], how='left')

Want to stack or union two tables? Let’s get a list of all housing prices from both tables.

select price
from housing
union
select price
from more_housing

In Pandas,

pd.concat([housing_df, more_housing_df]).price

Group by and having

Say we want to count how many houses there are when grouped by num_bedrooms and is_two_story.

select num_bedrooms,
       is_two_story,
       count(*)
from housing
group by num_bedrooms, is_two_story

In Pandas,

housing.groupby(['num_bedrooms', 'is_two_story']).size()

Want to sort those values now?

housing.groupby(['num_bedrooms', 'is_two_story']).size().to_frame('size').reset_index().sort_values(['num_bedrooms', 'size'])

Or what if you just want to consider groups where the count is greater than 5?

housing.groupby(['num_bedrooms', 'is_two_story']).filter(lambda x: len(x) > 5).groupby(['num_bedrooms', 'is_two_story']).size().to_frame('size').reset_index().sort_values(['num_bedrooms', 'size'])

A note about multiple indices

Say you’re doing some quick analysis on COVID-19 data and want to get the total number of confirmed cases over time for the US. You’d start with:

us_df = df[df['Country'] == 'US'].groupby(['Country', 'Last Updated']).sum()[['Confirmed']]

The problem is that now the dataframe as two indices and this can make it hard to plot the data. We can unstack the first column in the groupby, Country, to leave us with the Last Updated column as an index:

us_df = us_df.unstack(level=0)

Now we just rename the only column we have, add the index as a new column, then just replace the index with the row count.

us_df.columns = ['Confirmed']
us_df['Last Updated'] = us_df.index
us_df.index = np.arange(us_df.shape[0])

Hopefully this helps!

For more details about implementation, make sure to check out Pandas’ documentation. There are also some other good posts about SQL to Pandas usage like this one.

How to Rewrite SQL Queries in Pandas

Select, order by, and limit

Filtering with where

Joins

Group by and having

A note about multiple indices

Hopefully this helps!

See Also 👀