In descriptive statistics, a box plot or boxplot (also known as box and whisker plot) is a type of chart often used in explanatory data analysis (EDA). Box plots visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. Link
A Box Plot is the visual representation of the statistical five number summary of a given data set, including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score.
- Minimum Score: The lowest score, excluding outliers (shown at the end of the left whisker).
- Lower Quartile: Twenty-five percent of scores fall below the lower quartile value (also known as the first quartile).
- Median: The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less.
- Upper Quartile: Seventy-five percent of the scores fall below the upper quartile value (also known as the third quartile). Thus, 25% of data are above this value.
- Maximum Score: The highest score, excluding outliers (shown at the end of the right whisker).
- Whiskers: The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of scores and the upper 25% of scores).
- The Interquartile Range (or IQR): This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile).
Detect Outliers using Boxplot
Box plots are useful as they show outliers within a data set.
An outlier is an observation that is numerically distant from the rest of the data.
When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.
For example, outside 1.5 times the interquartile range above the upper quartile and below the lower quartile (Q1 – 1.5 * IQR or Q3 + 1.5 * IQR)
For Code refer Module 4 (4.2-Data_Cleaning_Outlier_and_Handling_Text)
#import library import pandas as pd #create data outlier_data= [1,3,5,7,9,11,20] df = pd.DataFrame(outlier_data, columns=['Values']) #Box Plot uses IQR df.boxplot(column=['Values']) ## OR ## Use seaborn library # import seaborn as sns # ax = sns.boxplot(x=outlier_data)
For more examples, refer here.