Univariate Analysis

Univariate analysis explores each variable (feature) one by one, separately. “Uni” means “one”, and variate means variable. So, only single variable will be used at a time for statistical analysis and visualization techniques.

Univariate Analysis fall under Descriptive Statistics. The objective of univariate analysis is to describe or summarizes data and finds patterns. It explores each variable separately in a dataset. Variables can be either categorical or numerical.

Univariate Variables Type

Before going into the details of Univariate Analysis, let’s first revise the types of Data in statistics (check the below article):

Data Types in Statistics

  1. Numerical (or Quantitative) data
    • Discrete data
    • Continuous data
  2. Categorical (or Qualitative) data
    • Nominal data
    • Ordinal data

Numerical

A numerical variable (feature or attribute) is one that may take on any value within a finite or infinite interval (e.g., height, weight, temperature, blood glucose,…).

1.  Measure of Central Tendencies

Min, Max, Mean, Median, Mode

StatisticsVisualizationEquationDescription
CountHistogramNThe number of values (observations) of the variable.
MinimumBox PlotMinThe smallest value of the variable.
MaximumBox PlotMaxThe largest value of the variable.
MeanBox Plotmean-fromulaThe sum of the values divided by the count.
MedianBox Plotmedian-formulaThe middle value. Below and above median lies an equal number of values. Video
ModeHistogram The most frequent value. There can be more than one mode.
Measure of Central Tendencies

Advantages/ Disadvantages

Which ‘Measures of Central Tendency’ should be used and when? Link

Type of VariableBest measure of central tendency
NominalMode
OrdinalMedian
Interval/Ratio (not skewed)Mean
Interval/Ratio (skewed)Median
When to use: Mean, Median, Mode

2.  Measure of Variability

Range, Quantile, IQR, Variance, Standard Deviation, Coefficient of Variation

StatisticsVisualizationEquationDescription
RangeBox PlotMax-MinThe difference between maximum and minimum.
QuantileBox PlotA set of ‘cut points’ that divide a set of data into groups containing equal numbers of values (Quartile, Quintile, Percentile, …).
VarianceHistogramA measure of data dispersion.
Standard DeviationHistogramThe square root of variance.
Coefficient of VariationHistogramA measure of data dispersion divided by mean.
Measure of Variability

Range

The difference between maximum and minimum value.

Quantile

A set of ‘cut points’ that divide a set of data into groups containing equal numbers of values or equal-sized subgroups (Median, Quartile, Percentile, and more). Link

  • The only 2-quantile or dividing the set of data into 2 groups is called the median
  • The 3-quantiles are called tertiles or terciles → T
  • The 4-quantiles are called quartiles → Q. The difference between upper and lower quartiles is also called the interquartile rangemid-spread or middle fifty → IQR = Q3 − Q1.
  • The 5-quantiles are called quintiles → QU
  • The 6-quantiles are called sextiles → S
  • The 7-quantiles are called septiles
  • The 8-quantiles are called octiles
  • The 10-quantiles are called deciles → D
  • The 100-quantiles are called percentiles → P

IQR (Interquartile Range)

A measure of statistical dispersion and variability based on dividing a data set into quartiles.

Mid-spread or middle fifty → IQR = Q3 − Q1

Interquartile Range (IQR)

Variance

Variance

Variance is a measure of data dispersion. In simple words, it is a measure of how far a set of numbers is spread out from their average value. Article explains difference in population vs sample variance using examples, video, and code. Symbol of Variance is σ2 (Sigma Square).

Standard Deviation

Standard Deviation

Standard Deviation is a measure of data dispersion relative to its mean. Standard Deviation is also calculated as the square root of the Variance. Symbol of standard deviation is σ (the Greek letter Sigma). Article covers all aspects using examples, video, and python code.

Coefficient of Variation

Coefficient of Variation

The coefficient of variation (CV) represents the ratio of the standard deviation to the mean, and it is a useful statistic for comparing the degree of variation from one data series to another, even if the means are drastically different from one another.

3. Curve Types

StatisticsVisualizationEquationDescription
SkewnessHistogramA measure of symmetry or asymmetry in the distribution of data.
KurtosisHistogramA measure of wheth1er the data are peaked or flat relative to a normal distribution.
Skewness, Kurtosis

Skewness

Skewness

Skewness is a degree of asymmetry observed in a probability distribution that deviates from the symmetrical normal distribution (bell curve) in a given set of data. Skewness types – Symmetrical, right and left skewed. Article explains skewness using examples, video, and python code.

Kurtosis

Kurtosis

Kurtosis provides information about distribution. There are 3 types of Kurtosis: Mesokurtic, Leptokurtic Platykurtic. Article explains types of kurtosis using examples, video, visualization, and python code.

4. Numerical Visualization

Histogram

A histogram plot shows the underlying frequency distribution (shape) of a set of continuous data. Using histogram we can understand features of data such as distribution (normal or skewed), outliers, etc.

There are visual similarities in histograms and bar plots as both display the same categorical variables against the category of data. However, histogram is generally used on continuous data to display it as bins which indicate the number of data points in a range.

Histogram with 'Frequency' on the y-axis and 'Age' on the x-axis
Histogram Plot

Boxplot

Boxplot

Boxplot or Whisker plot visually show the distribution of numerical data. A Box Plot is the visual representation of the statistical five points summary, including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score.

Categorical

1. Frequency Table

frequency table counts occurrence of each category of the variable. In addition to count we can add percentages that fall into each category. We can get Count and Count%.

For example, in a class variable we can understand the category count and percentage of boys and girls students.

StudentsCountCount%
Boys2652%
Girls2458%
Frequency Table

2. Categorical Visualization

Pie Chart

A Pie Chart is a circular graph and the “pie slices” denote the relative size of that particular category. Pie charts are mainly used to understand division of group into smaller pieces. The whole pie represents 100 percent.

Pie Chart

Bar Plot

The bar plot (or bar graph or bar chart) is very suitable for comparing categories of data or different groups of data. It can help track and compare changes over time. Mostly used for visualizing discrete data.

Bar Chart

Leave a Comment

Keytodatascience Logo

Connect

Subscribe

Join our email list to receive the latest updates.

© 2022 KeyToDataScience