Logistic Regression

Logistic Regression is a supervised learning algorithm which is used when the dependent variable (or target) is categorical.

The dependent variable should be categorical or finite:

  • either A or B (binary logistic regression)
  • or, a range of finite options A, B, C or D (multinomial or ordinal logistic regression).

Table of Content

  1. Overview
  2. Types of Logistic Regression
  3. How Logistic Regression Works
  4. Assumptions of Logistic Regression
  5. Video
  6. Model Evaluation
  7. Python Code

1. Overview

The Logistic Regression algorithm assumes a relationship between the dependent variable (y) and one or more independent variables (x) and predict the output probability using a logistic regression equation.

Using logistic regression we can predict the likelihood of an event happening. For example,

  • We can know the likelihood of a visitor shopping or not on the website (binary dependent variable). The independent (input) variables can be the known characteristics of visitors, such as the repeat visits to the site, number of clicks, etc. Logistic regression model can help in determining a probability of the conversion of a visitor as a customer.
  • To predict whether an email is spam (1) or not (0). Consider a scenario in which we need to classify whether an email is spam or not. Logistic regression model can predict the probability between 0 to 1 of email being spam, probability closer to 0 means not spam and probability closer to 1 means email is spam. We can use some threshold value based on which classification can be done. Say, if the probability is above 0.5, the e-mail will be classified as spam, otherwise as spam.

2. Types of Logistic Regression

On the basis of the dependent variable output categories, Logistic Regression can be classified into three types:

  1. Binary or Binomial: In Binary Logistic regression, a dependent variable will have only two possible output types such as 1 or 0, pass or fail, yes or no, win or loss, etc.
  2. Multinomial: In multinomial Logistic regression, a dependent variable can have 3 or more possible types of unordered output (having no quantitative significance). For example in case of iris flower classification the output is setosa, virginica or versicolor.
  3. Ordinal: In ordinal Logistic regression, a dependent variable can have 3 or more possible types of ordered output (having a quantitative significance). For example the output of dependent variables can be 3 or more possible ordered types such as,
    • low, medium, or high
    • poor, good, very good, excellent
    • scores like 1, 2, 3, 4 or 5

3. How Logistic Regression Works

The logistic regression model uses the sigmoid (logistic) function to squeeze the output of a linear equation between 0 and 1. Let’s first learn more about sigmoid function.

Sigmoid Function

  • The sigmoid function is a mathematical function used to map any real value into probabilities or a value within range of 0 and 1.
  • The value of of probability in case of classification model must be between 0 and 1, so sigmoid function convert any real value in a curve like the “S” form. This S-form curve is known as Sigmoid function or the logistic function.
  • In logistic regression, we use a threshold value which converts the probability to a decision value. Such as, in case of binary logistic regression the output should be 0 or 1. Let’s suppose threshold a 0.5. So, values above the threshold value (0.5) gets converted to 1, and a value below the threshold value (0.5) gets converted to 0. As discussed in the above example in which we have to predict whether an email is spam or not.
Sigmoid Function
Sigmoid Function

If value of t increases to infinity, y predicted (or output) value will become 1 and if t decreased to negative infinity, y predicted (or output) value will tends to 0.

Logistic Regression Explanation

The logistic regression model uses this sigmoid function to squeeze the output of a linear regression equation between 0 and 1.

Let’s first put down the equation of linear regression algorithm. We have modeled the relationship between hypothesis function (predicted output) and input features (independent variables) using a linear equation as follows:

hθ(x)=θ01x1(i)2x2(i)+…..pxp(i)

For classification, we prefer probabilities between 0 and 1. So let’s wrap the right side of the above equation of hθ(x) into the sigmoid function t. This forces the linear regression line output to take values between 0 and 1 only.

Hypothesis function for Logistic Regression is:

Logistic Regression Hypothesis Function
Logistic Regression Hypothesis Function

We can not use Linear regression’s Sum of Squared Error cost as it will give non-convex function (has local minima). The above mentioned hypothesis function in Logistic Regression is a nonlinear function gives a convex function having a global minima.

Convex and Non Convex Functions global local minima
Convex and Non Convex Functions

Cost Function of Logistic Regression

Cost Function of Logistic Regression
Cost Function of Logistic Regression

or, cost function of Logistic Regression algorithm can be written as:

Cost (hθ(x),y) = −y log(hθ(x)) − (1−y) log(1−hθ(x))

The intuition behind the Cost function of Logistic regression (or Log-loss) is how close the prediction probability is to the corresponding actual (or true) value. The more the difference of predicted probability from the actual value, the higher is the log-loss value of cost function is.

4. Assumptions of Logistic Regression

Let’s understand few assumptions to consider while building logistic regression model:

  • The dependent (or target) variable must be categorical in nature.
  • The independent (or input) variables should not have any multi-collinearity, which means the variables must be independent of each other.

5. Video

Alternate Video

6. Model Evaluation

Classification Model Evaluation

Exhaustive list of evaluation methods for Classification machine learning models in data science. Article explains confusion matrix, precision, recall, F1 score, specificity, AUC-ROC, etc. using examples, formulae, python code.

7. Python Code

1. Load important libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

2. Load Data Set

path = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(path)
df.head()

3. Data Preparation – Create independent(X) and dependent (y) Variables

X = df[['Pclass','SibSp','Parch','Sex']]
y = df['Survived']

# Converting Male/ Female to integer
X["Sex"]=np.where(X['Sex'] == "male", 1, 0)

4. Split data set into train and test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

5. Train Logistic Regression model on train set

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train,y_train)

6. Predict and Evaluate Logistic Regression model on test set

y_pred = lr.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

result = accuracy_score(y_test,y_pred)
print("Accuracy on Test Set:",round(result,2))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:",)
print (classification_report(y_test, y_pred))
Accuracy on Test Set: 0.78

Confusion Matrix:
[[139  21]
 [ 38  70]]

Classification Report:
                               precision    recall  f1-score   support

                    0                 0.79      0.87      0.82       160
                     1                 0.77      0.65      0.70       108

         accuracy                                          0.78       268
      macro avg              0.78      0.76      0.76       268
weighted avg               0.78      0.78      0.78       268

Learn more Data Science Algorithms here.

Leave a Comment

Keytodatascience Logo

Connect

Subscribe

Join our email list to receive the latest updates.

© 2022 KeyToDataScience