Boxplot

In descriptive statistics, a box plot or boxplot (also known as box and whisker plot) is a type of chart often used in explanatory data analysis (EDA). Box plots visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. Link

A Box Plot is the visual representation of the statistical five number summary of a given data set, including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score.

Features of a Box Plot (also called a box and whisker plot)

Definitions

  • Minimum Score: The lowest score, excluding outliers (shown at the end of the left whisker).
  • Lower Quartile: Twenty-five percent of scores fall below the lower quartile value (also known as the first quartile).
  • Median: The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less.
  • Upper Quartile: Seventy-five percent of the scores fall below the upper quartile value (also known as the third quartile). Thus, 25% of data are above this value.
  • Maximum Score: The highest score, excluding outliers (shown at the end of the right whisker).
  • Whiskers: The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of scores and the upper 25% of scores).
  • The Interquartile Range (or IQR): This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile).

Detect Outliers using Boxplot

Box plots are useful as they show outliers within a data set.

An outlier is an observation that is numerically distant from the rest of the data.

When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.

box plot outliers
Box Plot with Outliers (Source)

For example, outside 1.5 times the interquartile range above the upper quartile and below the lower quartile (Q1 – 1.5 * IQR or Q3 + 1.5 * IQR)

box plots showing skewness of a data set compared with distribution curves

Video

Code

For Code refer Module 4 (4.2-Data_Cleaning_Outlier_and_Handling_Text)

#import library
import pandas as pd

#create data
outlier_data= [1,3,5,7,9,11,20]

df = pd.DataFrame(outlier_data, columns=['Values'])

#Box Plot uses IQR

df.boxplot(column=['Values'])

## OR
## Use seaborn library
# import seaborn as sns
# ax = sns.boxplot(x=outlier_data)
Boxplot with outliers Python Code

For more examples, refer here.

Python for Data Science

Kurtosis

Leave a Comment

Keytodatascience Logo

Connect

Subscribe

Join our email list to receive the latest updates.

© 2022 KeyToDataScience