Bivariate Analysis

Bivariate analysis is the simultaneous analysis of two variables. These variables could be dependent or independent to each other. Bivariate analysis explores the concept of relationship between two variables, if there exists an association and the strength of this association.

Let’s see an example of bivariate analysis. In a classroom, a teacher analyze the ratio of students who scored above 75% corresponding to their genders. In this case, there are two variables: gender and result.

As a data analyst or data scientist to uncover the story of the data, understanding which analysis to use and when helps us navigate through the data swiftly.

Depending on variables types, there are three types of bivariate analysis:

  1. Numerical & Numerical
  2. Categorical & Categorical
  3. Numerical & Categorical

Let’s see how to choose the right type of analysis technique for Bivariate Analysis.

1. Numerical & Numerical

Causality

Relationship between two events where one event is affected by the other.

Covariance

A quantitative measure of the joint variability between two random variables. Read more

Correlation

Measure the relationship between numerical two variables. Ranges from -1 to 1. Normalized version of covariance.

Correlation analysis quantifies the strength of association between two continuous variables, for example, an dependent and independent variable or among two independent variables. Its value is always between -1 and 1.

  • Correlation = 0: When there is no correlation between two variables, there is no tendency for the values of first variable to increase or decrease with the values of the second variable.
  • Correlation > 0: When there is negative correlation between two variables, if the first variable increases then there is tendency for second variable to also increase and if the first variable decreases then there is tendency for second variable to also decrease. +1 means perfect positive linear correlation
  • Correlation < 0: When there is negative correlation between two variables, if the first variable increases then there is tendency for second variable to decease and vice versa. -1 means perfect negative linear correlation.

Scatter Plot

A scatter plot is a useful visual representation of the relationship between two numerical variables (attributes) and is usually drawn before working out a linear correlation or fitting a regression line. The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables. More information can be added to a two-dimensional scatter plot, for example, we might label points with a code to indicate the level of a third variable. If we are dealing with many variables in a data set, a way of presenting all possible scatter plots of two variables at a time is in a scatter plot matrix.

# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Create a dataset:
df=pd.DataFrame({'X': range(20,50), 'y': np.random.randn(30)*5+range(70,100) })

# plot
plt.scatter(x=df['X'], y=df['y'])
plt.title("Ice Cream Sales Data")
plt.ylabel("Ice cream sales")
plt.xlabel("Temperature (°C)")
plt.show()
Scatter Plot

2. Categorical & Categorical

Chi-square Test

The chi-square test (symbolically represented as χ2) is used for determining the association between two categorical variables. It is calculated based on the difference between expected frequencies and the observed frequencies using frequency table. Read more

  • Probability = 1: indicates a complete dependency between two categorical variables
  • Probability = 0: indicates that two categorical variables are completely independent. 

Another video

Stacked Column Chart

Stacked Column chart helps to visualize the relationship between two categorical variables. It compares the percentage that each category from first variable contributes to a total across categories of the second variable. Let’s understand this using an python code example.

import seaborn as sns
from matplotlib import pyplot as plt

# load sample data set from seaborn
tips = sns.load_dataset('tips')

# groupby data at day and gender level
tips_nw = tips.groupby(['day', 'sex'])['tip'].sum().unstack().fillna(0)
tips_nw.plot(kind='bar', stacked=True)

# plot graph and rotate the x-axis labels to horizontal
plt.title('Tips by Day and Gender')
plt.xticks(rotation=0, ha='center')
Stacked Column Chart

3. Numerical & Categorical

Z-test and T-test calculates whether the difference between a sample and population is substantial. Z-test and t-test are basically the same. Only difference is, when the sample size is large enough (greater than 30), then we use a Z-test, and for a small sample size, we use a T-test (is less than 30).

Z-test

Z-test check whether the averages of two groups are statistically different from each other. Count of values in two groups should be greater than 30. If the probability of Z-test is small, the difference between the averages of two groups is more significant.

T-test

T-test check whether the averages of two groups are statistically different from each other. It is used when the count of values in two groups is less than 30.

ANOVA

The ANOVA test is used to determine whether there is a significant difference among the averages of more than two groups that are statistically different from each other. This analysis is appropriate for comparing the averages of a numerical variable for more than two categories of a categorical variable.

Video1 Video2

Bar Chart

The bar plot (or bar graph or bar chart) is very suitable for comparing categories of data or different groups of data. It can help track and compare changes over time. Mostly used for visualizing discrete data.

# import libraries
import seaborn as sns

sns.set_style("dark")

# create data set
students = [17,15,21,11]
sourses = ['Data Analyst', 'Business Analyst', 'Data Scientist', 'Data Engineer']

# plot bar chart
sns.barplot(x = courses,
 y = students)
plt.xlabel("Courses offered")
plt.ylabel("No. of students enrolled")
plt.title("KeytoDataScience - Students enrolled in different courses")
Bar Chart

Line Chart

A line chart (or line plot or line graph) displays information as a series of data points connected by straight line.

# import libraries
import matplotlib.pyplot as plt
import numpy as np

# create data set
line = np.cumsum(np.random.randn(20,1))

# plot line chart
plt.plot(line)
Line plot

Decision Tree

Multivariate Analysis

Leave a Comment

Keytodatascience Logo

Connect

Subscribe

Join our email list to receive the latest updates.

© 2022 KeyToDataScience