There are multiple instances where we have to select the rows and columns from a Pandas DataFrame by multiple conditions. Let’s see a few commonly used approaches to filter rows or columns of a dataframe using the indexing and selection in multiple ways. For example, one can use label based indexing with loc function.
Table of Contents:
- Select data by multiple conditions (Boolean Variables)
- Select data by conditional statement (.loc)
- Set values for selected subset data in DataFrame
- Select data by row numbers (.iloc)
- Select all rows containing a substring
Data Preparation
# import modules
import pandas as pd
import numpy as np
# create a dataframe
raw_data = {'first_name': ['Sam', 'Ziva', 'Kia', 'Robin', 'Kim'],
'degree': ["MBA", "MS", "Graduate", "Arts", "MS"],
'nationality': ["USA", "India", "UK", "France", "Canada"],
'age': [25, 29, 19, 21, 33],
'grade':['A+', 'A', 'C', np.nan, 'B-']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'degree','nationality',
'age','grade'])
df
first_name | degree | nationality | age | grade | |
0 | Sam | MBA | USA | 25 | A+ |
1 | Ziva | MS | India | 29 | A |
2 | Kia | Graduate | UK | 19 | C |
3 | Robin | Arts | France | 21 | NaN |
4 | Kim | MS | Canada | 33 | B- |
1. Select data using Boolean Variables
Select rows or columns based on conditions in Pandas DataFrame using different operators. First, let’s check operators to select rows based on particular column value using '>', '=', '=', '<=', '!='
operators.
# select rows where age is greater than 28
df[df['age'] > 28]
first_name | degree | nationality | age | grade | |
---|---|---|---|---|---|
1 | Ziva | MS | India | 29 | A |
4 | Kim | MS | Canada | 33 | B- |
# select all cases where age is greater than 28 and grade is 'A'
df[(df['age'] > 28) & (df['grade'] == 'A')]
first_name | degree | nationality | age | grade | |
1 | Ziva | MS | India | 29 | A |
# select the degree cell where age is greater than 28 and grade is 'A'
df[(df['age'] > 28) & (df['grade'] == 'A')]["degree"]
# another way of writing the same thing
#df[(df['age'] > 28) & (df['grade'] == 'A')].degree
1 MS Name: degree, dtype: object
2. Select data using “loc”
The syntax of the “loc” indexer is: data.loc[<row selection>, <column selection>].
Let’s repeat all the previous examples using loc indexer. In the next section we will compare the differences between the two.
# select rows where age is greater than 28
df.loc[df['age'] > 28]
first_name | degree | nationality | age | grade | |
---|---|---|---|---|---|
1 | Ziva | MS | India | 29 | A |
4 | Kim | MS | Canada | 33 | B- |
# select all cases where age is greater than 28 and grade is 'A'
df.loc[(df['age'] > 28) & (df['grade'] == 'A')]
first_name | degree | nationality | age | grade | |
1 | Ziva | MS | India | 29 | A |
# select the degree cell where age is greater than 28 and grade is 'A'
df.loc[(df['age'] > 28) & (df['grade'] == 'A'),"degree"]
1 MS Name: degree, dtype: object
3. Set values for selected subset data in DataFrame
Using “.loc”, DataFrame update can be done in the same statement of selection and filter with a slight change in syntax. You can update values in columns applying different conditions.
For example, we will update the degree of persons whose age is greater than 28 to “PhD”.
# select the rows where age is greater than 28
df.loc[df['age'] > 28, "degree"] = "PhD"
first_name | degree | nationality | age | grade | |
---|---|---|---|---|---|
1 | Ziva | MS | India | 29 | A |
4 | Kim | MS | Canada | 33 | B- |
#Update the degree to "PhD" for the selected persons
df.loc[df['age'] > 28, "degree"] = "PhD"
first_name | degree | nationality | age | grade | |
---|---|---|---|---|---|
1 | Ziva | PhD | India | 29 | A |
4 | Kim | PhD | Canada | 33 | B- |
However, boolean operations do not work in case of updating DataFrame values.
# this statement will not update degree to "PhD" for the selected rows
df[df['age'] > 28].degree = "PhD"
4. Select data using “iloc”
The iloc syntax is data.iloc[<row selection>, <column selection>]. “iloc” in pandas is used to select rows and columns by number, in the order that they appear in the DataFrame. Both row and column numbers start from 0 in python.
a. Single Selection
# rows
df.iloc[0] # first row of data frame (Sam)
df.iloc[-1] # last row of data frame (Kim)
# columns
df.iloc[:,0] # first column of data frame (first_name)
df.iloc[:,-1] # last column of data frame (grade)
b. Multiple Selection
# multiple row and column selections using iloc and DataFrame
df.iloc[0:3] # first three rows of dataframe
df.iloc[:, 0:2] # first two columns of data frame with all rows
df.iloc[[0,3], [0,2]] # 1st, 4th row and 1st, 3rd columns
df.iloc[0:2, 1:4] # first 2 rows and 2nd, 3rd, 4th columns of data frame (degree-age).
5. Select all rows containing a sub string
Select rows in DataFrame which contain the substring. We will use str.contains() function
#Select rows which contain "duate" substring while ignoring the case
df[df.degree.str.contains('duate',case=False)]
first_name | degree | nationality | age | grade | |
---|---|---|---|---|---|
2 | Kia | Graduate | UK | 19 | NaN |
Conclusion
We have covered the basics of indexing and selecting with Pandas. These Pandas functions are an essential part of any data munging task and will not throw an error if any of the values are empty or null or NaN. There are other useful functions that you can check in the official documentation.
Start with Data Science here.
Great post