GroupBy in Pandas Python (Latest 2022)

groupby pandas python code
GroupBy Python Code

This tutorial covers one of the most used functions while performing data analysis or munging tasks. Let’s explore GroupBy in python pandas with code snippets and examples.

“Group by” operation involves one or more of the following steps:

  • Splitting the data into groups based on some criteria.
  • Applying one or more functions to each group independently.
  • Combining the results into a data frame/data structure.

Table of Contents

  1. GroupBy on Single Column (Summary)
  2. GroupBy on Multiple Columns (NEW: Tuple, NamedAggregation)
  3. Operations on Groups

First, let’s create a DataFrame of having person first_name, degree, nationality, age, and salary.

# Import modules
import pandas as pd
import numpy as np
import random

# Create a dataframe
raw_data = {'first_name': ['Sam','Ziva','Kia','Robin','Kim','Ram','Zen','John','Ariel','Ron'], 
         'degree': ['PhD','MBA','PhD','MS','MBA','PhD','PhD','MS','PhD','PhD'],
         'nationality': ['USA','India','India','USA','India','UK','USA','USA','India','UK'],
         'age': [25, 29, 19, 21, 33, 38, 24, 26, 32, 29]}
df = pd.DataFrame(raw_data)

# Randomly generate Salary between 80K to 150K having steps (multiple) of 10K
df['salary']=0
df['salary']=df['salary'].apply(lambda x: random.randrange(80000,150000,10000))
 
df
first_namedegreenationalityagesalary
0SamPhDUSA2590000
1ZivaMBAIndia2980000
2KiaPhDIndia19110000
3RobinMSUSA21110000
4KimMBAIndia33110000
5RamPhDUK3880000
6ZenPhDUSA24130000
7JohnMSUSA26130000
8ArielPhDIndia3290000
9RonPhDUK29120000

If you have noticed, the salary is randomly generated between 80K to 150K having steps (multiple) of 10K using randrange function.

1. GroupBy on Single Column Python Pandas

Let’s find out the average salary of each degree type. GroupBy can be used to summarize degree column can using THREE methods listed below:

# METHOD 1: using groupby on degree column to find average salary
df.groupby('degree', as_index=False)['salary'].mean()

# OR METHOD 2:
#df.groupby('degree')['salary'].mean().reset_index()

# OR METHOD 3:
#df.groupby('degree', as_index=False).agg({"salary": "mean"})
degreesalary
0MBA125000.000000
1MS115000.000000
2PhD116666.666667

TIP: The group by output will have an index or multiple indexes (headers) corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” or “.reset_index()” to the groupby operation.

We have seen the application of the mean (average) function above. Pandas already provide a full range of groupby aggregation functions list built into the base Pandas package:

FunctionDescription
countNumber of non-null observations
sumSum of values
meanMean of values
madMean absolute deviation
medianArithmetic median of values
………so on

Pandas Data Aggregation: Find GroupBy Count

Let’s look into the application of the .count() function. Count function is used to counts the occurrences of values in each group. In the case of the degree column, count each type of degree present.

# sort data by degree just for visualization (can skip this step)
df.sort_values(by='degree')
Python Pandas Aggregation GroupBy Count
Python| Pandas GroupBy Count Example
# Method 1: count each type of degree 
df.groupby('degree', as_index=False)['salary'].count()
degreesalary
0MBA2
1MS2
2PhD6

2. GroupBy on Multiple Columns Python Pandas

Group By on two or more columns is possible and easy using Pandas.

a. Apply Single Functions on Columns

#groupby on nationality & degree, taking max of age and summation of salary per group
df.groupby(
   ['nationality','degree']
).agg(
    {
         'salary':"sum",    # sum of salary per group
         'age': "max"  # max of age per group
    }
).reset_index()
nationalitydegreesalaryage
0IndiaMBA19000033
1IndiaPhD20000032
2UKPhD20000038
3USAMS24000026
4USAPhD22000025

For Nationality India and degree MBA, the maximum age is 33.

b. Apply Multiple Functions on Columns

Multiple functions can be applied to a single column. In the above example, we can show both the minimum and maximum value of the age column.

Pandas Tuple Aggregations (Recommended):

Introduced in Pandas 0.25.0, Pandas has added new groupby behavior “named aggregation” and tuples, for naming the output columns when applying multiple aggregation functions to specific columns.

df.groupby(
     ['nationality','degree']
 ).agg(
     mean_salary = ('salary','mean'),
     min_age     = ('age',   'min'),
     max_age     = ('age',   'max')
 ).reset_index()
nationalitydegreemean_salarymin_agemax_age
0IndiaMBA1100002933
1IndiaPhD1100001932
2UKPhD1350002938
3USAMS1150002126
4USAPhD1150002425

Pandas Named Aggregation:

For clearer naming, Pandas also provides the NamedAggregation named-tuple, which can be used to achieve the same as normal tuples:

df.groupby(
    ['nationality','degree']
).agg(
    mean_salary = pd.NamedAgg(column='salary', aggfunc='mean'),
    min_age     = pd.NamedAgg(column='age',    aggfunc='min'),
    max_age     = pd.NamedAgg(column='age',    aggfunc='max')
).reset_index()

This will give the same output as the above table.

c. Rename columns name in Groupby Python Pandas (Automatically)

Sometimes there are a lot of columns in the data frame. So, manually renaming each row while performing group by is a cumbersome task. This step can be automated using str. Let’s check the example below:

# automatically rename age column according to the functions performed on them in groupby
grp = df.groupby('degree').agg({'age': ['min', 'max', 'mean']})
grp.columns = grp.columns.droplevel(level=0)
grp.columns = ['age_'+str(col) for col in grp.columns]
grp
age_minage_maxage_mean
degree   
MBA293331
MS212623.5
PhD193827.83333

3. Operations on Groups

a. View Groups

The DataFrame df, seems jumbled. So to have a clearer view, we have two options either sort the data or view by groups. Let’s structure data using groups.

grp=df.groupby(['nationality','degree'])
#print the first value in each group 
grp.first()
first_nameagesalary
nationalitydegree   
IndiaMBAZiva29110000
 PhDKia19140000
UKPhDRam38140000
USAMSRobin21130000
 PhDSam2590000

b. Select Group

Using the get_group() function, we can select a single group. For example, degree has 3 groups: MS, MBA and PhD. Let’s select all values falls under MS group.

#groupby degree
grp=df.groupby('degree')
#select the degree values present in the "MS" group 
#grp.get_group('MS') 
first_namedegreenationalityagesalary
3RobinMSUSA21130000
7JohnMSUSA26100000

Conclusion

Groupby is also one of the most frequent functions used while data analysis. In this tutorial, we have mastered groupby aggregation and summarization which is really straightforward in python pandas. For official documentation, click here.

Leave a Comment

Keytodatascience Logo

Connect

Subscribe

Join our email list to receive the latest updates.

© 2022 KeyToDataScience