Scikit-learn Logistic Regression: A Comprehensive Guide

Classification techniques are an essential component of machine learning and data mining applications. Approximately 70% of data science problems are classification problems. Among the various classification algorithms available, logistic regression stands out as a simple and commonly used method, particularly useful for solving binary classification problems. Another category of classification is Multinomial classification, which handles the issues where multiple classes are present in the target variable. For example, the IRIS dataset is a very famous example of multi-class classification. Logistic regression can be used for various classification problems, such as spam detection. It is easy to implement and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. This article provides a comprehensive guide to understanding and implementing logistic regression using scikit-learn.

What is Logistic Regression?

Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature, meaning there are only two possible classes. For example, it can be used for cancer detection problems. It is a special case of linear regression where the target variable is categorical. It uses a log of odds as the dependent variable. Where y is a dependent variable and x1, x2 … Linear Regression Vs. Linear regression gives you a continuous output, but logistic regression provides a constant output. An example of the continuous output is house price and stock price. Examples of the discrete output are predicting whether a patient has cancer or not and predicting whether a customer will churn.

Linear Regression vs. Logistic Regression

While both are regression techniques, they differ significantly in their application and output. Linear regression gives you a continuous output, but logistic regression provides a constant output. An example of the continuous output is house price and stock price. Examples of the discrete output are predicting whether a patient has cancer or not and predicting whether a customer will churn.

Maximum Likelihood Estimation (MLE) vs. Ordinary Least Squares (OLS)

Maximum Likelihood Estimation Vs. The MLE is a "likelihood" maximization method, while OLS is a distance-minimizing approximation method. Maximizing the likelihood function determines the parameters that are most likely to produce the observed data. From a statistical point of view, MLE sets the mean and variance as parameters in determining the specific parametric values for a given model. Ordinary Least squares estimates are computed by fitting a regression line on given data points that has the minimum sum of the squared deviations (least square error). Both are used to estimate the parameters of a linear regression model.

Types of Logistic Regression

Logistic regression can be classified into three main types based on the nature of the dependent variable:

Read also: Learn Forex Trading

  • Binomial Logistic Regression: This type is used when the dependent variable has only two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most common form of logistic regression and is used for binary classification problems.

  • Multinomial Logistic Regression: This is used when the dependent variable has three or more possible categories that are not ordered. For example, classifying animals into categories like "cat," "dog" or "sheep." It extends the binary logistic regression to handle multiple classes.

  • Ordinal Logistic Regression: This type applies when the dependent variable has three or more categories with a natural order or ranking. Examples include ratings like "low," "medium" and "high." It takes the order of the categories into account when modeling.

Assumptions of Logistic Regression

Understanding the assumptions behind logistic regression is important to ensure the model is applied correctly, main assumptions are:

  • Independent observations: Each data point is assumed to be independent of the others means there should be no correlation or dependence between the input samples.
  • Binary dependent variables: It takes the assumption that the dependent variable must be binary, means it can take only two values. For more than two categories SoftMax functions are used.
  • Linearity relationship between independent variables and log odds: The model assumes a linear relationship between the independent variables and the log odds of the dependent variable which means the predictors affect the log odds in a linear way.
  • No outliers: The dataset should not contain extreme outliers as they can distort the estimation of the logistic regression coefficients.
  • Large sample size: It requires a sufficiently large sample size to produce reliable and stable results.

The Sigmoid Function

The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity, y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify it as 0 or NO.

Read also: Understanding the Heart

Understanding the Sigmoid Function

  1. The sigmoid function is a important part of logistic regression which is used to convert the raw output of the model into a probability value between 0 and 1.

  2. This function takes any real number and maps it into the range 0 to 1 forming an "S" shaped curve called the sigmoid curve or logistic curve. Because probabilities must lie between 0 and 1, the sigmoid function is perfect for this purpose.

  3. In logistic regression, we use a threshold value usually 0.5 to decide the class label.

    • If the sigmoid output is same or above the threshold, the input is classified as Class 1.
    • If it is below the threshold, the input is classified as Class 0.

This approach helps to transform continuous input values into meaningful class predictions.

How Logistic Regression Works

Logistic regression model transforms the linear regression function continuous value output into categorical value output using a sigmoid function which maps any real-valued set of independent variables input into a value between 0 and 1. This function is known as the logistic function.

Read also: Guide to Female Sexual Wellness

Suppose we have input features represented as a matrix:

X = \begin{bmatrix}x{11} & … & x{1m}\x{21} & … & x{2m} \\vdots & \ddots & \vdots \x{n1} & … & x{nm}\end{bmatrix}

and the dependent variable is Y having only binary value i.e 0 or 1.

Y = \begin{cases}0 & \text{ if } Class\;1 \1 & \text{ if } Class\;2\end{cases}

then, apply the multi-linear function to the input variables X.

z = \left(\sum{i=1}^{n} w{i}x_{i}\right) + b

Here x_i is the ith observation of X, w_i = [w_1, w_2, w_3, \cdots,w_m] is the weights or Coefficient and b is the bias term also known as intercept. Simply this can be represented as the dot product of weight and bias.

z = w\cdot X +b

At this stage, z is a continuous value from the linear regression. Logistic regression then applies the sigmoid function to z to convert it into a probability between 0 and 1 which can be used to predict the class.

Now we use the sigmoid function where the input will be z and we find the probability between 0 and 1. i.e. is always bounded between 0 and 1

where the probability of being a class can be measured as:

P(y=1) = \sigma(z) \P(y=0) = 1-\sigma(z)

Logistic Regression Equation and Odds

It models the odds of the dependent event occurring which is the ratio of the probability of the event to the probability of it not occurring:

\frac{p(x)}{1-p(x)} = e^z

Taking the natural logarithm of the odds gives the log-odds or logit:

\begin{aligned}\log \left[\frac{p(x)}{1-p(x)} \right] &= z \\\log \left[\frac{p(x)}{1-p(x)} \right] &= w\cdot X +b\\\frac{p(x)}{1-p(x)}&= e^{w\cdot X +b} \;\;\;\cdots\text{Exponentiate both sides}\\p(x) &=e^{w\cdot X +b}\cdot (1-p(x))\\p(x) &=e^{w\cdot X +b}-e^{w\cdot X +b}\cdot p(x))\\p(x)+e^{w\cdot X +b}\cdot p(x))&=e^{w\cdot X +b}\\p(x)(1+e^{w\cdot X +b}) &=e^{w\cdot X +b}\\p(x)&= \frac{e^{w\cdot X +b}}{1+e^{w\cdot X +b}}\end{aligned}

then the final logistic regression equation will be:

p(X;b,w) = \frac{e^{w\cdot X +b}}{1+e^{w\cdot X +b}} = \frac{1}{1+e^{-w\cdot X +b}}

This formula represents the probability of the input belonging to Class 1.

Likelihood Function for Logistic Regression

The goal is to find weights w and bias b that maximize the likelihood of observing the data.

For each data point ifor y=1, predicted probabilities will be: p(X;b,w) = p(x)for y=0 The predicted probabilities will be: 1-p(X;b,w) = 1-p(x)

L(b,w) = \prod_{i=1}^{n}p(x_i)^{y_i}(1-p(x_i))^{1-y_i}

Taking natural logs on both sides:

\begin{aligned}\log(L(b,w)) &= \sum_{i=1}^{n} y_i\log p(x_i)\;+\; (1-y_i)\log(1-p(x_i)) \\&=\sum_{i=1}^{n} y_i\log p(x_i)+\log(1-p(x_i))-y_i\log(1-p(x_i)) \\&=\sum_{i=1}^{n} \log(1-p(x_i)) +\sum_{i=1}^{n}y_i\log \frac{p(x_i)}{1-p(x_i}} \\&=\sum_{i=1}^{n} -\log1-e^{-(w\cdot x_i+b)} +\sum_{i=1}^{n}y_i (w\cdot x_i +b) \\&=\sum_{i=1}^{n} -\log1+e^{w\cdot x_i+b} +\sum_{i=1}^{n}y_i (w\cdot x_i +b)\end{aligned}

This is known as the log-likelihood function.

Gradient of the log-likelihood function

To find the best w and b we use gradient ascent on the log-likelihood function. The gradient with respect to each weight w_jis:

\begin{aligned}\frac{\partial J(l(b,w)}{\partial w_j}&=-\sum_{i=n}^{n}\frac{1}{1+e^{w\cdot x_i+b}}e^{w\cdot x_i+b} x_{ij} +\sum_{i=1}^{n}y_{i}x_{ij} \\&=-\sum_{i=n}^{n}p(x_i;b,w)x_{ij}+\sum_{i=1}^{n}y_{i}x_{ij} \\&=\sum_{i=n}^{n}(y_i -p(x_i;b,w))x_{ij}\end{aligned}

Terminologies Involved in Logistic Regression

Here are some common terms involved in logistic regression:

  • Independent Variables: These are the input features or predictor variables used to make predictions about the dependent variable.
  • Dependent Variable: This is the target variable that we aim to predict. In logistic regression, the dependent variable is categorical.
  • Logistic Function: This function transforms the independent variables into a probability between 0 and 1 which represents the likelihood that the dependent variable is either 0 or 1.
  • Odds: This is the ratio of the probability of an event happening to the probability of it not happening. It differs from probability because probability is the ratio of occurrences to total possibilities.
  • Log-Odds (Logit): The natural logarithm of the odds. In logistic regression, the log-odds are modeled as a linear combination of the independent variables and the intercept.
  • Coefficient: These are the parameters estimated by the logistic regression model which shows how strongly the independent variables affect the dependent variable.
  • Intercept: The constant term in the logistic regression model which represents the log-odds when all independent variables are equal to zero.
  • Maximum Likelihood Estimation (MLE): This method is used to estimate the coefficients of the logistic regression model by maximizing the likelihood of observing the given data.

Implementing Logistic Regression with Scikit-learn

Scikit-learn provides a straightforward way to implement logistic regression using its 4-step modeling pattern. This pattern is consistent across various machine learning models in scikit-learn, making it easy to learn and apply.

Scikit-learn 4-Step Modeling Pattern

  1. Import the model: Import the LogisticRegression class from the sklearn.linear_model module.

    from sklearn.linear_model import LogisticRegression
  2. Create an instance of the model: Instantiate the LogisticRegression class. You can specify hyperparameters here, or use the default values.

    logisticRegr = LogisticRegression()
  3. Train the model: Fit the model to your training data using the fit() method. This step involves the model learning the relationship between the features (Xtrain) and the target variable (ytrain).

    logisticRegr.fit(X_train, y_train)
  4. Predict on new data: Use the trained model to predict the target variable for new, unseen data using the predict() method.

    predictions = logisticRegr.predict(X_test)

Example: Digits Dataset

Let's illustrate this with the digits dataset, a built-in dataset in scikit-learn containing images of handwritten digits.

  1. Load the dataset:

    from sklearn.datasets import load_digitsdigits = load_digits()print("Image Data Shape" , digits.data.shape)print("Label Data Shape", digits.target.shape)
  2. Split the data into training and testing sets:

    tags: #scikit-learn #logistic #regression #tutorial

    Popular posts: