Data Science
Start filtering your data easily using pandas query function
Master dataset filtering using pandas query function!
Data analysis in Python is made easy with Pandas library. While doing data analysis task, often you need to select a subset of data to dive deep. And this can be easily achieved using pandas.DataFrame.query() function.
Therefore, it is important to understand how efficiently and effectively you can leverage it.
I have listed 10 examples explaining almost all the use-cases when you can use the query function to filter data points. At the end, you will be able to use pandas query function as and when needed without any struggle.
You can quickly jump to your favorite use-case with below index.
· Filtering using Single Condition
∘ Example 1
· Filtering on Multiple Conditions
∘ Example 1
∘ Example 2
∘ Example 3
· Filtering based on Text Columns
∘ Example 1
· Simple Math Operation in Pandas Query
∘ Example 1
∘ Example 2
· Built-In Functions in Pandas Query
∘ Example 1
∘ Example 2
· Filtering based on Date-Time Columns
∘ Example 1
∘ Example 2
· Inplace in Pandas Query Function
📍 Note: I’m using a self created Dummy_Sales_Data which you can get on my Github repo for Free under MIT License!!
Let’s import the dataset into pandas DataFrame — df
import pandas as pd
df = pd.read_csv("Dummy_Sales_Data_v1.csv")
df.head()
It is a simple 9999 x 12 Dataset which I created using Faker in Python 📚
You can download a complete notebook with all these examples at the end of this read!
Before going into the examples, a quick recap of query function in pandas —
Query function is used to extract the records based on the expression specified and returns a new DataFrame. The expression is the condition or combination of conditions written in the form of string.
In the backend, this expression is evaluated using eval()
function in pandas and returns the subset of the data or the records where the expression is evaluated as TRUE.
Therefore, to filter pandas DataFrame all you need to do is specify the condition within query function as you will see in subsequent examples.
Why you should use
query()
over pandas DataFrame.loc
,.iloc
and bracket[ ]
notation
- The pandas DataFrame properties —
.loc
and.iloc
— are specifically used to extract the subset of dataset based on row and column labels and indices. So it does not really offer you the flexibility to filter DataFrame based on a condition. - The bracket notation
[ ]
gives you the flexibility to filter DataFrame based on condition but it is syntactically bulky to write with multiple pairs of square brackets
On the other hand, pandas query()
function gives you the flexibility to extract subset of the DataFrame based on one or more conditions which are written as an absolutely easy-going expression. You really don’t need to think about or check for any missed brackets! 💯
Now, as you are aware about why you should prefer query() function to filter the dataset, let’s get started with the examples.
Beginning with the simplest use-case — filtering the DataFrame based on a single condition i.e. condition on one column only.
When filtering on single condition, the expression to be evaluated in the query()
function will contain only one condition. And output returned will contain all the rows where this expression evaluates to be TRUE
.
Example 1
Suppose you want to extract all the rows where Quantity is 95. So the condition in the logical form can be written as —
Quantity == 95
📌 Remember, you need to write this condition as a string i.e. enclose it within double quotes “ ”
.
So you can filter the DataFrame based on this condition as below —
df.query("Quantity == 95")
As simple as it looks. It returned all the rows where the Quantity is 95.
Well, this was quite simple and even you can even use bracket notation like this — df[df[“Quantity”]==95]
.
But what if you want to include one more condition on the same column??
It certainly adds one more pairs of square brackets in the bracket notation making it bulky and unmanageable in the long run. And that’s when effectiveness of query()
comes into the picture.
Whether you filter on one or multiple conditions, the syntax of query() remains same — write the conditions as string by enclosing them in “ ”
.
However, you must specify how you want to do filtering based on two or more conditions and accordingly you can choose from two logics between the conditions as below,
AND
: It returns all the records from the DataFrame only where both the conditions are satisfiedOR
: It returns all the records from the DataFrame where either or both the conditions are satisfied.
Let’s see how it works using below two examples.
Example 1
Suppose, you want to select all the rows where both conditions — Quantity is 95 & unit price is 182 — are TRUE.
So, you should go for AND
logic. In query() expression, it is implemented using the keyword and
. 💯
Note that, the column containing unit price is named as UnitPrice(USD)
So, conditions are —
Quantity == 95
UnitPrice(USD) == 182
And the expression will become —
"Quantity == 95 and UnitPrice(USD) == 182"
To extract the required dataset, you write —
df.query("Quantity == 95 and UnitPrice(USD) == 182")
However, instead of output, you get a KeyError on ‘UnitPrice’ as below. 🚨🚨
But why did you get this error??
It is because, query() function has some limitations on the column names. And the column name UnitPrice(USD) is invalid to use in query().
The query() expression, interprets UnitPrice from the column name UnitPrice(USD) as a function to be operated on variable USD.
📌 The query() guidelines provide a quick alternative on this problem as mentioning invalid column names in back ticks as — `UnitPrice(USD)`
So the correct expression you should write is —
df.query("Quantity == 95 and `UnitPrice(USD)` == 182")
As you can see, there are only 3 records when these both conditions are satisfied.
However, again there is a chance that you forgot to mention one of the back tick and program will throw an another Syntax Error. Hence, one of the simplest solution is to change the column name as below —
df.rename(columns={'UnitPrice(USD)':'UnitPrice',
'Shipping_Cost(USD)':'Shipping_Cost',
'Delivery_Time(Days)':'Delivery_Time'},
inplace=True)
So, you can now get the same output using the new column name as —
df.query("Quantity == 95 and UnitPrice == 182")
Alternatively, you can also get the same output using Ampersand operator &
as —
df.query("Quantity == 95 & UnitPrice == 182")
You can see how simple it is — you can write expression as if you are writing it on paper in simple English.
Now, let’s see how you can implement OR
logic.
Example 2
Suppose you want to get all the rows where at least one of the conditions mentioned above is satisfied.
All you need to do is use the keyword or
between two conditions as below —
df.query("Quantity == 95 or UnitPrice == 182")
It returned all the rows where either of the two condition True (see rows 2 to 5 in above picture) and also the rows where both conditions are True (row 1)
Here also, you can use bitwise operator |
instead of the or
keyword. ✅
Going a step ahead, you can also use NOT
logic in query() which returns all the records when the specified condition within query is evaluated to FALSE
.
Example 3
Suppose you want to get all the rows where Quantity is not equal to 95.
The simplest answer is to use not
keyword or negation operator ~
in the expression before the condition as shown below.
df.query("not (Quantity == 95)")
As you can see in the output, it contains all the rows where Quantity is not 95.
Moreover, condition need not be always an equality operator, rather you can choose anything from ==
, !=
, >
, <
, ≥
, ≤ while defining the condition.
So, you can get same output as NOT
logic, using non-equality operator !=
in the condition as below
df.query("Quantity != 95")
This will further save your efforts in writing (and may be forgetting to close) additional round brackets! And thus simplifying query()
expression.
Well, the conditions are not always need to be on numerical columns. You can always filter the data points based on non-numerical, text columns.
While filtering based on text columns, your condition should be comparing column name with a string.
Remember, your
query()
expression is already a String. Then how to write a String within another String??
📌 Simply enclose the Text value which you want to use in condition, in single quotes as ‘ ’
. Let’s see an example on how to do this.
Example 1
Suppose, you want to get all the records where Status is ‘Not Shipped’. You can write this in query()
expression as —
df.query("Status == 'Not Shipped'")
It returns all the records, where Status column contains value — ‘Not Shipped’.
Again you can use multiple conditions on same column or different column and it can be combination of conditions on numerical as well as non numerical columns. 💯
In real world, most of the times, the conditions which you use for filtering the DataFrame include certain calculations. Pandas query() gives you freedom to use mathematics within query expression.
The math operation can be anything such as addition, subtraction, multiplication, division or even square or cube of a value in the column.
As mathematical operations are meant to be used on numerical values, you can use them on the numerical columns only in the query expression as seen in examples below.
Example 1
For an instance, suppose you want to fetch all the rows from the dataset when double of shipping cost is less than 50.
This is as simple as writing the expression in plain English, as below
df.query("Shipping_Cost*2 < 50")
Boom! It returned all the required rows.
You can also include even a bit complex calculations on one or multiple columns. 💯
Example 2
Suppose, you wish to get all the rows where summation of square of quantity and square of shipping cost is less than 500.
df.query("Quantity**2 + Shipping_Cost**2 < 500")
You might have noticed, how simple it is to use even the complex calculations using query() functions.
However, you are not limited to use only different mathematical operations, rather you can use built-in functions within the query expressions.
The Python built-in functions such as sqrt()
, abs()
, factorial()
, exp()
and many others can be easily used within query expression. You can use it directly as you might have used it in normal use-cases.
Example 1
Retrieving all the rows where square root of unit price is more than 15. So the built-in function — sqrt()
— will be used here.
df.query("sqrt(UnitPrice) > 15")
Easy it is! All the rows with unit price more than 225 are returned through this query.
Further, query() function is so flexible that, you can also use built-in functions and math functions within the same query expression.
Example 2
Getting all the records where ‘square root of unit price is less than half of the shipping cost’ is so easy that, you can get it with —
df.query("sqrt(UnitPrice) < Shipping_Cost/2")
In this way, you can filter the dataset using multiple conditions with different complexity levels. And that too using a simple query. ✅
All the query examples which you observed so far, were about numerical and text columns. However, the usage of query() is not limited to only these data types.
Often you need to filter the DataFrame based on Date-time values. And query() function is so flexible, that you can easily filter the dataset based on date and time values, as you can explore in the below section.
The only requirement for using query() function to filter DataFrame on date-time values is, the column containing these values should be of data type datetime64[ns]
✅
In our example DataSet, the column OrderDate contains Date-time values, but it is parsed as String values. You can easily convert this column into required data type using —
df["OrderDate"] = pd.to_datetime(df["OrderDate"], format="%Y-%m-%d")
Now, you are all set to filter the DataFrame using date-time column in query()
. To extract useful information about dates and to use them in query() function, dt
accessor is quite handy.
📌 dt is a accessor object which is used to extract Date-time like properties of datetime series.
Let’s explore in what different ways you can filter the dataset based on Date-time values.
Example 1
Suppose you want to get all the records where order date is in August. All you need to do is —
df.query("OrderDate.dt.month == 8")
As you can see, all the records have OrderDate in August. And OrderDate.dt.month
shows how you can use dt
accessor to extract only month out of entire date value.
Filtering even further, suppose you want to get all the orders in August 2021 where order day is 15 or more. You can achieve it with —
df.query("OrderDate.dt.month == 8 and OrderDate.dt.year == 2021 and OrderDate.dt.day >=15")
Although, this is the classic example of dt
accessor and combining multiple conditions on same column, the expression seems to be too long and hence not a good practice.
It is written only to demonstrate, how to use dt
operator to extract different parts of date and how to combine multiple conditions.
However, you can get exactly same output by writing extremely simple expression as —
df.query("OrderDate >= '2021-08-15' and OrderDate <= '2021-08-31'")
Moreover, you can also combine the condition on Date-time column and a condition on any other column in single expression. 💯
Example 2
For an instance, retrieving all the records where order date as previous example and Status Delivered, is super-easy with query expression as —
df.query("OrderDate >= '2021-08-15' and OrderDate <= '2021-08-31' and Status == 'Delivered'")
It returned all the records where the expression is evaluated True. Again this just an example to demonstrate how you can combine Date-time and Text column conditions in single query.
So far you specified only the expression to be evaluated in the query() function. However, this function also takes another optional parameter — inplace
In all the examples you see that df.query() generates a new DataFrame. And it is due to the fact that, the 2nd parameter of query() — inplace
— is set to False by default.
So, even after filtering DataFrame on multiple conditions, if you check what is the size of DataFrame, it will show the original size of 9999 x 12
df.shape# output
(9999, 12)
So, with inplace=False
, the query did not modify the original data set. When you want to make the changes in original DataFrame, the simply make inplace=True
.
🚨 But be careful with inplace=True
, as it overwrites the original DataFrame. So there is no chance to get back the original DataFrame once the query with this option is executed.
That’s all about filtering the data points using query()!
I hope you found this article super-useful, refreshing and you learned some really cool tricks to filter pandas DataFrame. I am certain that, after this read you can use pandas query() function more frequently and fluently.
I am using Python for Data Analysis since past 4+ years and found query() function most handy to filter the dataset. Most of these tricks, I use in my work everyday.
Interested in reading more stories on Medium??
💡 Consider Becoming a Medium Member to access unlimited stories on medium and daily interesting Medium digest. I will get a small portion of your fee and No additional cost to you.
💡 Be sure to Sign-up to my Email list to never miss another article on data science guides, tricks and tips, SQL and Python.
💡 Here is a complete Notebook with all the examples.
Thank you for reading!