Univariate analysis explores each variable (feature) one by one, separately. “Uni” means “one”, and variate means variable. So, only single variable will be used at a time for statistical analysis and visualization techniques.
Univariate Analysis fall under Descriptive Statistics. The objective of univariate analysis is to describe or summarizes data and finds patterns. It explores each variable separately in a dataset. Variables can be either categorical or numerical.
Before going into the details of Univariate Analysis, let’s first revise the types of Data in statistics (check the below article):
A numerical variable (feature or attribute) is one that may take on any value within a finite or infinite interval (e.g., height, weight, temperature, blood glucose,…).
1. Measure of Central Tendencies
Min, Max, Mean, Median, Mode
|Count||Histogram||N||The number of values (observations) of the variable.|
|Minimum||Box Plot||Min||The smallest value of the variable.|
|Maximum||Box Plot||Max||The largest value of the variable.|
|Mean||Box Plot||The sum of the values divided by the count.|
|Median||Box Plot||The middle value. Below and above median lies an equal number of values. Video|
|Mode||Histogram||The most frequent value. There can be more than one mode.|
Which ‘Measures of Central Tendency’ should be used and when? Link
|Type of Variable||Best measure of central tendency|
|Interval/Ratio (not skewed)||Mean|
2. Measure of Variability
Range, Quantile, IQR, Variance, Standard Deviation, Coefficient of Variation
|Range||Box Plot||Max-Min||The difference between maximum and minimum.|
|Quantile||Box Plot||A set of ‘cut points’ that divide a set of data into groups containing equal numbers of values (Quartile, Quintile, Percentile, …).|
|Variance||Histogram||A measure of data dispersion.|
|Standard Deviation||Histogram||The square root of variance.|
|Coefficient of Variation||Histogram||A measure of data dispersion divided by mean.|
The difference between maximum and minimum value.
A set of ‘cut points’ that divide a set of data into groups containing equal numbers of values or equal-sized subgroups (Median, Quartile, Percentile, and more). Link
- The only 2-quantile or dividing the set of data into 2 groups is called the median
- The 3-quantiles are called tertiles or terciles → T
- The 4-quantiles are called quartiles → Q. The difference between upper and lower quartiles is also called the interquartile range, mid-spread or middle fifty → IQR = Q3 − Q1.
- The 5-quantiles are called quintiles → QU
- The 6-quantiles are called sextiles → S
- The 7-quantiles are called septiles
- The 8-quantiles are called octiles
- The 10-quantiles are called deciles → D
- The 100-quantiles are called percentiles → P
IQR (Interquartile Range)
A measure of statistical dispersion and variability based on dividing a data set into quartiles.
Mid-spread or middle fifty → IQR = Q3 − Q1
Variance is a measure of data dispersion. In simple words, it is a measure of how far a set of numbers is spread out from their average value. Article explains difference in population vs sample variance using examples, video, and code. Symbol of Variance is σ2 (Sigma Square).
Coefficient of Variation
The coefficient of variation (CV) represents the ratio of the standard deviation to the mean, and it is a useful statistic for comparing the degree of variation from one data series to another, even if the means are drastically different from one another.
3. Curve Types
|Skewness||Histogram||A measure of symmetry or asymmetry in the distribution of data.|
|Kurtosis||Histogram||A measure of wheth1er the data are peaked or flat relative to a normal distribution.|
Skewness is a degree of asymmetry observed in a probability distribution that deviates from the symmetrical normal distribution (bell curve) in a given set of data. Skewness types – Symmetrical, right and left skewed. Article explains skewness using examples, video, and python code.
Kurtosis provides information about distribution. There are 3 types of Kurtosis: Mesokurtic, Leptokurtic Platykurtic. Article explains types of kurtosis using examples, video, visualization, and python code.
4. Numerical Visualization
A histogram plot shows the underlying frequency distribution (shape) of a set of continuous data. Using histogram we can understand features of data such as distribution (normal or skewed), outliers, etc.
There are visual similarities in histograms and bar plots as both display the same categorical variables against the category of data. However, histogram is generally used on continuous data to display it as bins which indicate the number of data points in a range.
Boxplot or Whisker plot visually show the distribution of numerical data. A Box Plot is the visual representation of the statistical five points summary, including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score.
1. Frequency Table
A frequency table counts occurrence of each category of the variable. In addition to count we can add percentages that fall into each category. We can get Count and Count%.
For example, in a class variable we can understand the category count and percentage of boys and girls students.
2. Categorical Visualization
A Pie Chart is a circular graph and the “pie slices” denote the relative size of that particular category. Pie charts are mainly used to understand division of group into smaller pieces. The whole pie represents 100 percent.
The bar plot (or bar graph or bar chart) is very suitable for comparing categories of data or different groups of data. It can help track and compare changes over time. Mostly used for visualizing discrete data.