Tutorial: Everything You Need to Know to Filter Data with Pandas DataFrames (2022) – Dataquest (2023)

February 24, 2022

Tutorial: Everything You Need to Know to Filter Data with Pandas DataFrames (2022) – Dataquest (1)

The Pandas library is a fast, powerful, and easy-to-use tool for working with data. It helps us cleanse, explore, analyze, and visualize data by providing game-changing capabilities. Having data as a Pandas DataFrame allows us to slice and dice data in various ways and filter the DataFrame’s rows effortlessly. This tutorial will go over the primary methods for getting data out of a DataFrame using different filtering techniques.

What is a DataFrame?

Before discussing different filtering DataFrames techniques, let’s review a DataFrame. Technically a DataFrame builds upon the Pandas Series object that a Series is a one-dimensional array of indexed data. Intuitively a DataFrame is similar to a spreadsheet in many ways; it can have multiple columns with varying types of data and rows labeled with row indices.

Creating a DataFrame

First let’s create a dummy DataFrame containing the personal details of a company’s employees using the following snippet:

import pandas as pdimport numpy as npfrom IPython.display import displaydata = pd.DataFrame({'EmployeeName': ['Callen Dunkley', 'Sarah Rayner', 'Jeanette Sloan','Kaycee Acosta', 'Henri Conroy', 'Emma Peralta','Martin Butt', 'Alex Jensen', 'Kim Howarth', 'Jane Burnett'], 'Department': ['Accounting', 'Engineering', 'Engineering', 'HR', 'HR','HR','Data Science','Data Science', 'Accounting', 'Data Science'], 'HireDate': [2010, 2018, 2012, 2014, 2014, 2018, 2020, 2018, 2020, 2012], 'Gender': ['M','F','F','F','M','F','M','M','M','F'], 'DoB':['04/09/1982', '14/04/1981','06/05/1997','08/01/1986','10/10/1988','12/11/1992', '10/04/1991','16/07/1995','08/10/1992','11/10/1979'], 'Weight':[78,80,66,67,90,57,115,87,95,57], 'Height':[176,160,169,157,185,164,195,180,174,165] })display(data)
EmployeeNameDepartmentHireDateGenderDoBWeightHeight
0Callen DunkleyAccounting2010M04/09/198278176
1Sarah RaynerEngineering2018F14/04/198180160
2Jeanette SloanEngineering2012F06/05/199766169
3Kaycee AcostaHR2014F08/01/198667157
4Henri ConroyHR2014M10/10/198890185
5Emma PeraltaHR2018F12/11/199257164
6Martin ButtData Science2020M10/04/1991115195
7Alex JensenData Science2018M16/07/199587180
8Kim HowarthAccounting2020M08/10/199295174
9Jane BurnettData Science2012F11/10/197957165

Run the code above. It creates a DataFrame with the above data.
Let’s take a look at the columns’ data types by running the following statement:

print(data.dtypes)
EmployeeName objectDepartment objectHireDate int64Gender objectDoB objectWeight int64Height int64dtype: object

The Pandas data type for string values is object that is why the data type of columns like EmployeeName or Department containing strings is object. The DataFrame’s columns each have a specific data type; in other words, the DataFrame’s columns don’t share the same data type.
The other useful DataFrame property is .values, which returns an array of lists, each list representing a DataFrame’s row.

print(data.values)
[['Callen Dunkley' 'Accounting' 2010 'M' '04/09/1982' 78 176] ['Sarah Rayner' 'Engineering' 2018 'F' '14/04/1981' 80 160] ['Jeanette Sloan' 'Engineering' 2012 'F' '06/05/1997' 66 169] ['Kaycee Acosta' 'HR' 2014 'F' '08/01/1986' 67 157] ['Henri Conroy' 'HR' 2014 'M' '10/10/1988' 90 185] ['Emma Peralta' 'HR' 2018 'F' '12/11/1992' 57 164] ['Martin Butt' 'Data Science' 2020 'M' '10/04/1991' 115 195] ['Alex Jensen' 'Data Science' 2018 'M' '16/07/1995' 87 180] ['Kim Howarth' 'Accounting' 2020 'M' '08/10/1992' 95 174] ['Jane Burnett' 'Data Science' 2012 'F' '11/10/1979' 57 165]]

If you’re interested to know the number of rows and columns of the DataFrame, you can use the .shape property as follows:

print(data.shape)
(10, 7)

The tuple above represents the number of rows and columns, respectively.

Slicing and Dicing a DataFrame from Every Possible Angle

So far, we’ve created and inspected a DataFrame. As mentioned earlier, there are different ways for filtering a DataFrame. In this section, we will discuss the most prevalent and efficient methods. However, it would be better to look at how we can select specific columns, first.
To grab the entire of specific columns, we can use the following syntax:

DataFrame.ColumnName # if the column name does not contain any spacesOR DataFrame["ColumnName"]

For example, let’s grab the Department column of the DataFrame.

display(data.Department)
0 Accounting1 Engineering2 Engineering3 HR4 HR5 HR6 Data Science7 Data Science8 Accounting9 Data ScienceName: Department, dtype: object

Running the statement above returns the Series containing the Department column’s data.

Also, we can select multiple columns by passing a list of column names, as follows:

(Video) How To Apply Functions To DataFrames - Pandas For Machine Learning 15

display(data[['EmployeeName','Department','Gender']])
EmployeeNameDepartmentGender
0Callen DunkleyAccountingM
1Sarah RaynerEngineeringF
2Jeanette SloanEngineeringF
3Kaycee AcostaHRF
4Henri ConroyHRM
5Emma PeraltaHRF
6Martin ButtData ScienceM
7Alex JensenData ScienceM
8Kim HowarthAccountingM
9Jane BurnettData ScienceF

Running the statement above gives us a DataFrame that is a subset of the original DataFrame.

The .loc and .iloc Methods

We can use slicing techniques to extract specific rows from a DataFrame. The best way to slice a DataFrame and select particular columns is to use the .loc and .iloc methods. The twin methods create a subset of a DataFrame using label-based or integer-based indexing, respectively.

DataFrame.loc[row_indexer,column_indexer]DataFrame.iloc[row_indexer,column_indexer]

The first position in both methods specifies the row indexer, and the second specifies the column indexer separated with a comma.
Before we want to practice working with the .loc and .iloc methods, let’s set the EmployeeName column as the index of the DataFrame. Doing this is not required but helps us discuss all the capabilities of these methods. To do so, use the set_index() method to make the EmployeeName column the index of the DataFrame:

data = data.set_index('EmployeeName')display(data)
DepartmentHireDateGenderDoBWeightHeight
EmployeeName
Callen DunkleyAccounting2010M04/09/198278176
Sarah RaynerEngineering2018F14/04/198180160
Jeanette SloanEngineering2012F06/05/199766169
Kaycee AcostaHR2014F08/01/198667157
Henri ConroyHR2014M10/10/198890185
Emma PeraltaHR2018F12/11/199257164
Martin ButtData Science2020M10/04/1991115195
Alex JensenData Science2018M16/07/199587180
Kim HowarthAccounting2020M08/10/199295174
Jane BurnettData Science2012F11/10/197957165

Now, instead of having integer-based indexes, we have label-based indexes, which makes it more apt to try all the capabilities of .loc and .iloc methods for slicing and dicing the DataFrame.

NOTE

The .loc method is label-based, so you have to specify rows and columns based on their labels. On the other hand, the iloc method is integer index-based, so you have to select rows and columns by their integer index.

The following statements return the identical results practically:

display(data.loc[['Henri Conroy', 'Kim Howarth']])display(data.iloc[[4,8]])
DepartmentHireDateGenderDoBWeightHeight
EmployeeName
Henri ConroyHR2014M10/10/198890185
Kim HowarthAccounting2020M08/10/199295174
DepartmentHireDateGenderDoBWeightHeight
EmployeeName
Henri ConroyHR2014M10/10/198890185
Kim HowarthAccounting2020M08/10/199295174
(Video) Build A Custom Search Engine In Python With Filtering

In the above code, the .loc method gets a list of the employee names as row labels and returns a DataFrame containing the rows associated with the two employees.
The .iloc method does the same thing, but instead of getting the labeled indexes, it gets the integer indexes of the rows.
Now, let’s extract a sub-DataFrame that contains the name, department, and gender values of the first three employees using the .loc and .iloc methods:

display(data.iloc[:3,[0,2]])display(data.loc['Callen Dunkley':'Jeanette Sloan',['Department','Gender']])
DepartmentGender
EmployeeName
Callen DunkleyAccountingM
Sarah RaynerEngineeringF
Jeanette SloanEngineeringF
DepartmentGender
EmployeeName
Callen DunkleyAccountingM
Sarah RaynerEngineeringF
Jeanette SloanEngineeringF

Running the statements above returns the identical ​results as follows:

EmployeeName Department Gender Callen Dunkley Accounting MSarah Rayner Engineering FJeanette Sloan Engineering F

The .loc and .iloc methods are pretty similar; the only difference is how referring to columns and rows in a DataFrame.
The following statements return all the rows of the DataFrame:

display(data.iloc[:,[0,2]])display(data.loc[:,['Department','Gender']])
DepartmentGender
EmployeeName
Callen DunkleyAccountingM
Sarah RaynerEngineeringF
Jeanette SloanEngineeringF
Kaycee AcostaHRF
Henri ConroyHRM
Emma PeraltaHRF
Martin ButtData ScienceM
Alex JensenData ScienceM
Kim HowarthAccountingM
Jane BurnettData ScienceF
DepartmentGender
EmployeeName
Callen DunkleyAccountingM
Sarah RaynerEngineeringF
Jeanette SloanEngineeringF
Kaycee AcostaHRF
Henri ConroyHRM
Emma PeraltaHRF
Martin ButtData ScienceM
Alex JensenData ScienceM
Kim HowarthAccountingM
Jane BurnettData ScienceF

We’re able to do slicing for columns too. The statement below returns the Weight and Height columns of every second row of the DataFrame:

display(data.loc[::2,'Weight':'Height'])display(data.iloc[::2,4:6])
WeightHeight
EmployeeName
Callen Dunkley78176
Jeanette Sloan66169
Henri Conroy90185
Martin Butt115195
Kim Howarth95174
(Video) Cleaning NBA Stats Data With Python And Pandas: Data Project [part 2 of 3]
WeightHeight
EmployeeName
Callen Dunkley78176
Jeanette Sloan66169
Henri Conroy90185
Martin Butt115195
Kim Howarth95174

We perform slicing on the DataFrame’s rows and columns in the code above. The syntax is similar to slicing a list in Python. We need to specify the start index, the stop index, and the step size.

EmployeeName Weight Height Callen Dunkley 78 176Jeanette Sloan 66 169Henri Conroy 90 185Martin Butt 115 195Kim Howarth 95 174

Notice that in slicing with labels using the .loc method, both the start and the stop labels are included. However, in slicing using .iloc method the stop index is excluded.

Filtering a DataFrame in Pandas

So far, we’ve learned how to slice and dice a DataFrame, but how can we take data that meets specific criteria?
Using Boolean masks is the common method for filtering a DataFrame in Pandas. First let’s see what is a Boolean mask:

print(data.HireDate > 2015)
EmployeeNameCallen Dunkley FalseSarah Rayner TrueJeanette Sloan FalseKaycee Acosta FalseHenri Conroy FalseEmma Peralta TrueMartin Butt TrueAlex Jensen TrueKim Howarth TrueJane Burnett FalseName: HireDate, dtype: bool

By running the code above, we can see which entries in the HireDate column are greater than 2015. The Boolean values indicate whether the condition is met or not.
Now, we can use the boolean mask to get the subset of the DataFrame containing the
employees who are hired after 2015:

display(data[data.HireDate > 2015])
DepartmentHireDateGenderDoBWeightHeight
EmployeeName
Sarah RaynerEngineering2018F14/04/198180160
Emma PeraltaHR2018F12/11/199257164
Martin ButtData Science2020M10/04/1991115195
Alex JensenData Science2018M16/07/199587180
Kim HowarthAccounting2020M08/10/199295174

We also can use the .loc method to filter the rows along with returning certain columns, as follows:

display(data.loc[data.HireDate > 2015, ['Department', 'HireDate']])
DepartmentHireDate
EmployeeName
Sarah RaynerEngineering2018
Emma PeraltaHR2018
Martin ButtData Science2020
Alex JensenData Science2018
Kim HowarthAccounting2020

We’re able to combine two or more criteria using logical operators. Let’s list those male employees working in Data Science team:

display(data.loc[(data.Gender == "M") & (data.Department=='Data Science'), ['Department', 'Gender', 'HireDate']] )
DepartmentGenderHireDate
EmployeeName
Martin ButtData ScienceM2020
Alex JensenData ScienceM2018
(Video) How to use groupby() to group categories in a pandas DataFrame

What if we want to get all employees with heights between 170 cm and 190 cm?

display(data.loc[data.Height.between(170,190), ['Department', 'Gender', 'Height']] )
DepartmentGenderHeight
EmployeeName
Callen DunkleyAccountingM176
Henri ConroyHRM185
Alex JensenData ScienceM180
Kim HowarthAccountingM174

In the between() method both ends of the range are inclusive.
The other helpful method is isin(), which creates a Boolean mask for values contained in a list of specified values. Let’s try the method to retrieve the employees who are working in either Human Resources or Accounting departments:

display(data.loc[data.Department.isin(['HR', 'Accounting']),['Department', 'Gender', 'HireDate']])
DepartmentGenderHireDate
EmployeeName
Callen DunkleyAccountingM2010
Kaycee AcostaHRF2014
Henri ConroyHRM2014
Emma PeraltaHRF2018
Kim HowarthAccountingM2020

To filter a DataFrame, we can use regular expressions to search for a pattern rather than the exact content. The following example shows how we can use a regular expression to get those employees whose last names end with either ‘tt’ or ‘th’:

display(data.loc[data.index.str.contains(r'tt$|th$'), ['Department', 'Gender']])
DepartmentGender
EmployeeName
Martin ButtData ScienceM
Kim HowarthAccountingM
Jane BurnettData ScienceF

Conclusion

In this tutorial, we discussed different ways of slicing and dicing a DataFrame and filtering rows with specific criteria. we addressed how to take subsets of data through selecting columns, slicing rows, and filtering based on conditions, methods or regular expressions.

PandasTutorials

You May Also Like

Do You Post Too Much? Analyze Your Personal Facebook Data with PythonREAD MORE
Tutorial: Understanding Regression Error Metrics in PythonREAD MORE
(Video) Predict Football Match Winners With Machine Learning And Python

Videos

1. What is Pandas? Why and How to Use Pandas in Python
(Python Programmer)
2. How to combine DataFrames in Pandas | Merge, Join, Concat, & Append
(Chart Explorers)
3. Predict The Stock Market With Machine Learning And Python
(Dataquest)
4. LEARN PANDAS in about 10 minutes! A great python module for Data Science!
(Python Programmer)
5. Web Scraping Football Matches From The EPL With Python [part 1 of 2]
(Dataquest)
6. 3 Python Pandas Dataframe | Different ways DF creation [Urdu | Hindi] tutorial
(Data Insights School)
Top Articles
Latest Posts
Article information

Author: Duncan Muller

Last Updated: 03/11/2023

Views: 6227

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Duncan Muller

Birthday: 1997-01-13

Address: Apt. 505 914 Phillip Crossroad, O'Konborough, NV 62411

Phone: +8555305800947

Job: Construction Agent

Hobby: Shopping, Table tennis, Snowboarding, Rafting, Motor sports, Homebrewing, Taxidermy

Introduction: My name is Duncan Muller, I am a enchanting, good, gentle, modern, tasty, nice, elegant person who loves writing and wants to share my knowledge and understanding with you.