Logistic Regression is a supervised learning algorithm which is used when the dependent variable (or target) is categorical.
The dependent variable should be categorical or finite:
- either A or B (binary logistic regression)
- or, a range of finite options A, B, C or D (multinomial or ordinal logistic regression).
Table of Content
- Overview
- Types of Logistic Regression
- How Logistic Regression Works
- Assumptions of Logistic Regression
- Video
- Model Evaluation
- Python Code
Table of Contents
1. Overview
The Logistic Regression algorithm assumes a relationship between the dependent variable (y) and one or more independent variables (x) and predict the output probability using a logistic regression equation.
Using logistic regression we can predict the likelihood of an event happening. For example,
- We can know the likelihood of a visitor shopping or not on the website (binary dependent variable). The independent (input) variables can be the known characteristics of visitors, such as the repeat visits to the site, number of clicks, etc. Logistic regression model can help in determining a probability of the conversion of a visitor as a customer.
- To predict whether an email is spam (1) or not (0). Consider a scenario in which we need to classify whether an email is spam or not. Logistic regression model can predict the probability between 0 to 1 of email being spam, probability closer to 0 means not spam and probability closer to 1 means email is spam. We can use some threshold value based on which classification can be done. Say, if the probability is above 0.5, the e-mail will be classified as spam, otherwise as spam.
2. Types of Logistic Regression
On the basis of the dependent variable output categories, Logistic Regression can be classified into three types:
- Binary or Binomial: In Binary Logistic regression, a dependent variable will have only two possible output types such as 1 or 0, pass or fail, yes or no, win or loss, etc.
- Multinomial: In multinomial Logistic regression, a dependent variable can have 3 or more possible types of unordered output (having no quantitative significance). For example in case of iris flower classification the output is setosa, virginica or versicolor.
- Ordinal: In ordinal Logistic regression, a dependent variable can have 3 or more possible types of ordered output (having a quantitative significance). For example the output of dependent variables can be 3 or more possible ordered types such as,
- low, medium, or high
- poor, good, very good, excellent
- scores like 1, 2, 3, 4 or 5
3. How Logistic Regression Works
The logistic regression model uses the sigmoid (logistic) function to squeeze the output of a linear equation between 0 and 1. Let’s first learn more about sigmoid function.
Sigmoid Function
- The sigmoid function is a mathematical function used to map any real value into probabilities or a value within range of 0 and 1.
- The value of of probability in case of classification model must be between 0 and 1, so sigmoid function convert any real value in a curve like the “S” form. This S-form curve is known as Sigmoid function or the logistic function.
- In logistic regression, we use a threshold value which converts the probability to a decision value. Such as, in case of binary logistic regression the output should be 0 or 1. Let’s suppose threshold a 0.5. So, values above the threshold value (0.5) gets converted to 1, and a value below the threshold value (0.5) gets converted to 0. As discussed in the above example in which we have to predict whether an email is spam or not.
If value of t increases to infinity, y predicted (or output) value will become 1 and if t decreased to negative infinity, y predicted (or output) value will tends to 0.
Logistic Regression Explanation
The logistic regression model uses this sigmoid function to squeeze the output of a linear regression equation between 0 and 1.
Let’s first put down the equation of linear regression algorithm. We have modeled the relationship between hypothesis function (predicted output) and input features (independent variables) using a linear equation as follows:
hθ(x)=θ0+θ1x1(i)+θ2x2(i)
+…..+θpxp(i)
For classification, we prefer probabilities between 0 and 1. So let’s wrap the right side of the above equation of hθ(x)
into the sigmoid function t. This forces the linear regression line output to take values between 0 and 1 only.
Hypothesis function for Logistic Regression is:
We can not use Linear regression’s Sum of Squared Error cost as it will give non-convex function (has local minima). The above mentioned hypothesis function in Logistic Regression is a nonlinear function gives a convex function having a global minima.
Cost Function of Logistic Regression
or, cost function of Logistic Regression algorithm can be written as:
Cost (hθ(x),y) = −y log(hθ(x)) − (1−y) log(1−hθ(x))
The intuition behind the Cost function of Logistic regression (or Log-loss) is how close the prediction probability is to the corresponding actual (or true) value. The more the difference of predicted probability from the actual value, the higher is the log-loss value of cost function is.
4. Assumptions of Logistic Regression
Let’s understand few assumptions to consider while building logistic regression model:
- The dependent (or target) variable must be categorical in nature.
- The independent (or input) variables should not have any multi-collinearity, which means the variables must be independent of each other.
5. Video
6. Model Evaluation
Classification Model Evaluation
Exhaustive list of evaluation methods for Classification machine learning models in data science. Article explains confusion matrix, precision, recall, F1 score, specificity, AUC-ROC, etc. using examples, formulae, python code.
7. Python Code
1. Load important libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
2. Load Data Set
path = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(path)
df.head()
3. Data Preparation – Create independent(X) and dependent (y) Variables
X = df[['Pclass','SibSp','Parch','Sex']]
y = df['Survived']
# Converting Male/ Female to integer
X["Sex"]=np.where(X['Sex'] == "male", 1, 0)
4. Split data set into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
5. Train Logistic Regression model on train set
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
6. Predict and Evaluate Logistic Regression model on test set
y_pred = lr.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = accuracy_score(y_test,y_pred)
print("Accuracy on Test Set:",round(result,2))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:",)
print (classification_report(y_test, y_pred))
Accuracy on Test Set: 0.78 Confusion Matrix: [[139 21] [ 38 70]] Classification Report: precision recall f1-score support 0 0.79 0.87 0.82 160 1 0.77 0.65 0.70 108 accuracy 0.78 268 macro avg 0.78 0.76 0.76 268 weighted avg 0.78 0.78 0.78 268
Learn more Data Science Algorithms here.