Selecting Rows and Columns Based on Conditions in Python Pandas DataFrame

There are multiple instances where we have to select the rows and columns from a Pandas DataFrame by multiple conditions. Let’s see a few commonly used approaches to filter rows or columns of a dataframe using the indexing and selection in multiple ways. For example, one can use label based indexing with loc function.

 

Table of Contents:

  • Select data by multiple conditions (Boolean Variables)
  • Select data by conditional statement (.loc)
    • Set values for selected subset data in DataFrame
  • Select data by row numbers (.iloc)
  • Select all rows containing a substring

Data Preparation

#Import modules
import pandas as pd
import numpy as np

#Create a dataframe
raw_data = {'first_name': ['Sam', 'Ziva', 'Kia', 'Robin', 'Kim'], 
         'degree': ["MBA", "MS", "Graduate", "Arts", "MS"],
         'nationality': ["USA", "India", "UK", "France", "Canada"],
         'age': [25, 29, 19, 21, 33],
         'grade':['A+',  'A', 'C', np.nan, 'B-']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'degree','nationality',
         'age','grade'])
df
 first_namedegreenationalityagegrade
0SamMBAUSA25A+
1ZivaMSIndia29A
2KiaGraduateUK19C
3RobinArtsFrance21NaN
4KimMSCanada33B-

 

Select data using Boolean Variables

Select rows or columns based on conditions in Pandas DataFrame using different operators. First, let’s check operators to select rows based on particular column value using '>', '=', '=', '<=', '!=' operators.

#Select rows where age is greater than 28
df[df['age'] > 28]
 first_namedegreenationalityagegrade
1ZivaMSIndia29A
4KimMSCanada33B-
#Select all cases where age is greater than 28 and grade is 'A'
df[(df['age'] > 28) & (df['grade'] == 'A')]
 first_namedegreenationalityagegrade
1ZivaMSIndia29A
#Select the degree cell where age is greater than 28 and grade is 'A'
df[(df['age'] > 28) & (df['grade'] == 'A')]["degree"]
#another way of writing the same thing
#df[(df['age'] > 28) & (df['grade'] == 'A')].degree
1    MS
Name: degree, dtype: object

Select data using “loc”

The syntax of the “loc” indexer is: data.loc[<row selection>, <column selection>].

Let’s repeat all the previous examples using loc indexer. In the next section we will compare the differences between the two.

#Select rows where age is greater than 28
df.loc[df['age'] > 28]
 first_namedegreenationalityagegrade
1ZivaMSIndia29A
4KimMSCanada33B-
#Select all cases where age is greater than 28 and grade is 'A' 
df.loc[(df['age'] > 28) & (df['grade'] == 'A')]
 first_namedegreenationalityagegrade
1ZivaMSIndia29A
#Select the degree cell where age is greater than 28 and grade is 'A'
df.loc[(df['age'] > 28) & (df['grade'] == 'A'),"degree"]
1    MS
Name: degree, dtype: object

Set values for selected subset data in DataFrame

Using “.loc”, DataFrame update can be done in the same statement of selection and filter with a slight change in syntax. You can update values in columns applying different conditions.

For example, we will update the degree of persons whose age is greater than 28 to “PhD”.

#Select the rows where age is greater than 28  
df.loc[df['age'] > 28, "degree"] = "PhD" 
first_namedegreenationalityagegrade
1ZivaMSIndia29A
4KimMSCanada33B-
#Update the degree to "PhD" for the selected persons
df.loc[df['age'] > 28, "degree"] = "PhD"
first_namedegreenationalityagegrade
1ZivaPhDIndia29A
4KimPhDCanada33B-

However, boolean operations do not work in case of updating DataFrame values.

#This statement will not update degree to "PhD" for the selected rows
df[df['age'] > 28].degree = "PhD" 

Select data using “iloc”

The iloc syntax is data.iloc[<row selection>, <column selection>]. “iloc” in pandas is used to select rows and columns by number, in the order that they appear in the DataFrame. Both row and column numbers start from 0 in python.

i. Single Selection

#rows
df.iloc[0] # first row of data frame (Sam)
df.iloc[-1] # last row of data frame (Kim)
#columns
df.iloc[:,0] # first column of data frame (first_name)
df.iloc[:,-1] # last column of data frame (grade)

ii. Multiple Selection

#Multiple row and column selections using iloc and DataFrame
df.iloc[0:3] # first three rows of dataframe
df.iloc[:, 0:2] # first two columns of data frame with all rows
df.iloc[[0,3], [0,2]] # 1st, 4th row and 1st, 3rd columns
df.iloc[0:2, 1:4] # first 2 rows and 2nd, 3rd, 4th columns of data frame (degree-age).

Select all rows containing a sub string

Select rows in DataFrame which contain the substring. We will use str.contains() function

#Select rows which contain "duate" substring while ignoring the case
df[df.degree.str.contains('duate',case=False)]
first_namedegreenationalityagegrade
2KiaGraduateUK19NaN

Conclusion

We have covered the basics of indexing and selecting with Pandas. These Pandas functions are an essential part of any data munging task and will not throw an error if any of the values are empty or null or NaN. There are other useful functions that you can check in the official documentation.

Start with Data Science here

Leave a Comment