This tutorial covers one of the most used functions while performing data analysis or munging tasks. Let’s explore GroupBy in python pandas with code snippets and examples.
“Group by” operation involves one or more of the following steps:
- Splitting the data into groups based on some criteria.
- Applying one or more functions to each group independently.
- Combining the results into a data frame/data structure.
Table of Contents
Table of Contents
- GroupBy on Single Column (Summary)
- GroupBy on Multiple Columns (NEW: Tuple, NamedAggregation)
- Operations on Groups
First, let’s create a DataFrame of having person first_name, degree, nationality, age, and salary.
# Import modules
import pandas as pd
import numpy as np
import random
# Create a dataframe
raw_data = {'first_name': ['Sam','Ziva','Kia','Robin','Kim','Ram','Zen','John','Ariel','Ron'],
'degree': ['PhD','MBA','PhD','MS','MBA','PhD','PhD','MS','PhD','PhD'],
'nationality': ['USA','India','India','USA','India','UK','USA','USA','India','UK'],
'age': [25, 29, 19, 21, 33, 38, 24, 26, 32, 29]}
df = pd.DataFrame(raw_data)
# Randomly generate Salary between 80K to 150K having steps (multiple) of 10K
df['salary']=0
df['salary']=df['salary'].apply(lambda x: random.randrange(80000,150000,10000))
df
first_name | degree | nationality | age | salary | |
---|---|---|---|---|---|
0 | Sam | PhD | USA | 25 | 90000 |
1 | Ziva | MBA | India | 29 | 80000 |
2 | Kia | PhD | India | 19 | 110000 |
3 | Robin | MS | USA | 21 | 110000 |
4 | Kim | MBA | India | 33 | 110000 |
5 | Ram | PhD | UK | 38 | 80000 |
6 | Zen | PhD | USA | 24 | 130000 |
7 | John | MS | USA | 26 | 130000 |
8 | Ariel | PhD | India | 32 | 90000 |
9 | Ron | PhD | UK | 29 | 120000 |
If you have noticed, the salary is randomly generated between 80K to 150K having steps (multiple) of 10K using randrange function.
1. GroupBy on Single Column Python Pandas
Let’s find out the average salary of each degree type. GroupBy can be used to summarize degree column can using THREE methods listed below:
# METHOD 1: using groupby on degree column to find average salary
df.groupby('degree', as_index=False)['salary'].mean()
# OR METHOD 2:
#df.groupby('degree')['salary'].mean().reset_index()
# OR METHOD 3:
#df.groupby('degree', as_index=False).agg({"salary": "mean"})
degree | salary | |
---|---|---|
0 | MBA | 125000.000000 |
1 | MS | 115000.000000 |
2 | PhD | 116666.666667 |
TIP: The group by output will have an index or multiple indexes (headers) corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” or “.reset_index()” to the groupby operation.
We have seen the application of the mean (average) function above. Pandas already provide a full range of groupby aggregation functions list built into the base Pandas package:
Function | Description |
---|---|
count | Number of non-null observations |
sum | Sum of values |
mean | Mean of values |
mad | Mean absolute deviation |
median | Arithmetic median of values |
………so on |
Pandas Data Aggregation: Find GroupBy Count
Let’s look into the application of the .count()
function. Count function is used to counts the occurrences of values in each group. In the case of the degree
column, count each type of degree present.
# sort data by degree just for visualization (can skip this step)
df.sort_values(by='degree')
# Method 1: count each type of degree
df.groupby('degree', as_index=False)['salary'].count()
degree | salary | |
0 | MBA | 2 |
1 | MS | 2 |
2 | PhD | 6 |
2. GroupBy on Multiple Columns Python Pandas
Group By on two or more columns is possible and easy using Pandas.
a. Apply Single Functions on Columns
#groupby on nationality & degree, taking max of age and summation of salary per group
df.groupby(
['nationality','degree']
).agg(
{
'salary':"sum", # sum of salary per group
'age': "max" # max of age per group
}
).reset_index()
nationality | degree | salary | age | |
---|---|---|---|---|
0 | India | MBA | 190000 | 33 |
1 | India | PhD | 200000 | 32 |
2 | UK | PhD | 200000 | 38 |
3 | USA | MS | 240000 | 26 |
4 | USA | PhD | 220000 | 25 |
For Nationality India and degree MBA, the maximum age is 33.
b. Apply Multiple Functions on Columns
Multiple functions can be applied to a single column. In the above example, we can show both the minimum and maximum value of the age column.
Pandas Tuple Aggregations (Recommended):
Introduced in Pandas 0.25.0, Pandas has added new groupby behavior “named aggregation” and tuples, for naming the output columns when applying multiple aggregation functions to specific columns.
df.groupby(
['nationality','degree']
).agg(
mean_salary = ('salary','mean'),
min_age = ('age', 'min'),
max_age = ('age', 'max')
).reset_index()
nationality | degree | mean_salary | min_age | max_age | |
---|---|---|---|---|---|
0 | India | MBA | 110000 | 29 | 33 |
1 | India | PhD | 110000 | 19 | 32 |
2 | UK | PhD | 135000 | 29 | 38 |
3 | USA | MS | 115000 | 21 | 26 |
4 | USA | PhD | 115000 | 24 | 25 |
Pandas Named Aggregation:
For clearer naming, Pandas also provides the NamedAggregation named-tuple, which can be used to achieve the same as normal tuples:
df.groupby(
['nationality','degree']
).agg(
mean_salary = pd.NamedAgg(column='salary', aggfunc='mean'),
min_age = pd.NamedAgg(column='age', aggfunc='min'),
max_age = pd.NamedAgg(column='age', aggfunc='max')
).reset_index()
This will give the same output as the above table.
c. Rename columns name in Groupby Python Pandas (Automatically)
Sometimes there are a lot of columns in the data frame. So, manually renaming each row while performing group by is a cumbersome task. This step can be automated using str. Let’s check the example below:
# automatically rename age column according to the functions performed on them in groupby
grp = df.groupby('degree').agg({'age': ['min', 'max', 'mean']})
grp.columns = grp.columns.droplevel(level=0)
grp.columns = ['age_'+str(col) for col in grp.columns]
grp
age_min | age_max | age_mean | |
---|---|---|---|
degree | |||
MBA | 29 | 33 | 31 |
MS | 21 | 26 | 23.5 |
PhD | 19 | 38 | 27.83333 |
3. Operations on Groups
a. View Groups
The DataFrame df, seems jumbled. So to have a clearer view, we have two options either sort the data or view by groups. Let’s structure data using groups.
grp=df.groupby(['nationality','degree'])
#print the first value in each group
grp.first()
first_name | age | salary | ||
---|---|---|---|---|
nationality | degree | |||
India | MBA | Ziva | 29 | 110000 |
PhD | Kia | 19 | 140000 | |
UK | PhD | Ram | 38 | 140000 |
USA | MS | Robin | 21 | 130000 |
PhD | Sam | 25 | 90000 |
b. Select Group
Using the get_group() function, we can select a single group. For example, degree has 3 groups: MS, MBA and PhD. Let’s select all values falls under MS group.
#groupby degree
grp=df.groupby('degree')
#select the degree values present in the "MS" group
#grp.get_group('MS')
first_name | degree | nationality | age | salary | |
---|---|---|---|---|---|
3 | Robin | MS | USA | 21 | 130000 |
7 | John | MS | USA | 26 | 100000 |
Conclusion
Groupby is also one of the most frequent functions used while data analysis. In this tutorial, we have mastered groupby aggregation and summarization which is really straightforward in python pandas. For official documentation, click here.