Sklearn Logistic Regression With Continuous Y
Logistic Regression with Scikit-Learn
Table of Contents
Introduction
Logistic regression is a machine learning model that helps predict the probability of an occurrence (called a 'class').
In this tutorial, we will use Scikit-Learn and its logistic regression primitives to predict the likelihood of a good night's sleep based on the number of sleep hours and awakenings.
Imports
1 import numpy as np 2 import pandas as pd 3 import matplotlib.pyplot as plt 4 from sklearn.model_selection import train_test_split 5 from sklearn.preprocessing import MinMaxScaler 6 from sklearn.linear_model import LogisticRegression 7 8 plt . rcParams [ 'figure.figsize' ] = [ 8 , 7 ] 9 plt . rcParams [ 'figure.dpi' ] = 100
Logistic Regression Model
A logistic regression is a kind of model that, unlike regular linear regression, helps predict the probability of a certain class (e.g., raining) being true, as a number between 1 and 0:
X -> y where 1.0 >= y >= 0.0
In a typical binary or binomial classification, features are associated with true and false values, as opposed to continuous values. The result ranges from absolutely true (1.0), to absolutely false (0.0).
Let's get to action. Let us consider a data set in which patients that suffer from sleep problems report the number of hours they had slept, how many times they had woken up during the night, and then, how well they feel in the morning, as 'sleep badly' (False
) or 'sleep well' (True
):
1 np . random . seed ( 3 ) 2 sleep_hours = np . linspace ( 1 , 9 , 100 ) 3 awakenings = np . random . randint ( 1 , 6 , size = 1001 ) 4 sleep_q = [ 6 + h + ( np . random . rand () * 3 ) - a 5 for ( h , a ) in zip ( sleep_hours , awakenings ) ] 6 X = np . array ([ [ h , a ] for ( h , a ) in zip ( sleep_hours , awakenings ) ]) 7 y = [ q >= 12 for q in sleep_q ] 8 9 good = [ ( h , a ) for ( h , a , y ) in zip ( sleep_hours , awakenings , y ) if y ] 10 bad = [ ( h , a ) for ( h , a , y ) in zip ( sleep_hours , awakenings , y ) if not y ] 11 12 plt . scatter ([ t [ 0 ] for t in good ], [ t [ 1 ] for t in good ], color = 'b' , label = 'Slept Well' ) 13 plt . scatter ([ t [ 0 ] for t in bad ], [ t [ 1 ] for t in bad ], color = 'r' , label = 'Slept Badly' ) 14 plt . legend ( loc = 'upper left' ) 15 plt . xlabel ( 'Sleep Hours' ) 16 plt . ylabel ( 'Awakenings' ) 17 plt . yticks ([ 1 , 2 , 3 , 4 , 5 ]) 18 plt . show ()
In this example, we won't split the data set into training and test sets, but train the model on the entire data set:
1 model = LogisticRegression () . fit ( X , y )
That's it. We can now interrogate the number. The regular predict()
method returns the class, in this case a bool
answer, as opposed to a float
value. We use precit_proba()
, instead, to obtain the actual probability for each class (False
, and True
, in our case).
Example 1: Patient slept 3.2 hours, and woke up 0 times. Was his sleep good?
1 model . predict ([[ 3.2 , 0 ]])
array([False])
1 model . predict_proba ([[ 3.2 , 0 ]])
array([[0.89892622, 0.10107378]])
Example 2: Patient slept 8 hours, but woke up 10 times. Was her sleep good?
1 model . predict ([[ 8.0 , 10 ]])
array([False])
1 model . predict_proba ([[ 8.0 , 10 ]])
array([[9.99984809e-01, 1.51910314e-05]])
Example 3: Patient slept 10 hours, and woke up 0 times. Was her sleep good?
1 model . predict ([[ 10.0 , 0 ]])
array([ True])
1 model . predict_proba ([[ 10.0 , 0 ]])
array([[2.42600004e-04, 9.99757400e-01]])
In this data set, albeit synthetic, it is interesting to note how the number of awakenings influences the chances of a good night sleep. This is where the probability value produced by predict_proba()
shines.
1 def probability ( awakenings ): 2 zero = [ [ h , model . predict_proba ([[ h , awakenings ]])[ 0 ][ 1 ]] 3 for h in np . linspace ( 0 , 12 , 12 ) ] 4 plt . plot ([ t [ 0 ] for t in zero ], [ t [ 1 ] for t in zero ]) 5 if awakenings >= 3 : 6 plt . xlabel ( "Sleep hours" ) 7 if awakenings in [ 0 , 3 ]: 8 plt . ylabel ( "Sleep well probability" ) 9 plt . yticks ([ 0.0 , 0.2 , 0.4 , 0.6 , 0.8 , 1 ]) 10 plt . xticks ([ 0 , 2 , 4 , 6 , 8 , 10 , 12 ]) 11 plt . title ( "# Awakenings = {} " . format ( awakenings )) 12 13 for x in range ( 0 , 6 ): 14 plt . subplot ( 2 , 3 , x + 1 ) 15 probability ( x )
Regularisation
A logistic regression uses a linear model which can be regularised, just like the regular regression models that produce continuous values. Let us first take a look at the class boundary for the model we were working on, without any added penalties.
1 def show_class_boundary ( models ): 2 plt . scatter ([ t [ 0 ] for t in good ], [ t [ 1 ] for t in good ], color = 'b' , label = 'Slept Well' ) 3 plt . scatter ([ t [ 0 ] for t in bad ], [ t [ 1 ] for t in bad ], color = 'r' , label = 'Slept Badly' ) 4 plt . xlabel ( 'Sleep Hours' ) 5 plt . ylabel ( 'Awakenings' ) 6 plt . yticks ([ 1 , 2 , 3 , 4 , 5 ]) 7 for i , m in enumerate ( models ): 8 class_boundary = [ ( h , a ) 9 for h in np . linspace ( 0 , 10 , 200 ) 10 for a in np . linspace ( 0 , 6 , 200 ) 11 if abs (( m . predict_proba ([[ h , a ]])[ 0 ][ 1 ]) - 0.5 ) 12 <= 0.001 ] 13 plt . plot ([ t [ 0 ] for t in class_boundary ], [ t [ 1 ] for t in class_boundary ], label = "Class Boundary {} " . format ( i + 1 )) 14 plt . legend ( loc = 'upper left' ) 15 16 show_class_boundary ([ model ])
In the above visualisation, we can see the class boundary line includes at least one red dot. In a real world scenario, this model is actually almost perfect, since the synthetic data set we have provided is actually linear. But let's suppose that we wanted this logistic regression to work in such a way that if the prediction is true, (the patient has slept well), then, there is no chance whatsoever for false positives.
A way to accomplish this is by adding a penalty (it is L2 by default, but L1 can be selected too), using the C
argument. In the below example, we set C=0.2
which has the effect of 'pushing' the class boundary to the right so that it no longer covers red dots.
1 model2 = LogisticRegression ( C = 0.2 ) . fit ( X , y ) 2 show_class_boundary ([ model , model2 ])
The penalty we have provided is small so it doesn't alter the score, which is already nearly 100%:
1 display ( model . score ( X , y )) 2 display ( model2 . score ( X , y ))
0.96 0.96
Conclusion
Logistic regression is one of the simplest 'classification' models, which is most useful when the separation between the classes is somewhat linear in fashion. In future tutorials, we will explore how to treat classes whose arrangement appears to be more arbitrary.
Before You Leave
🤘 Subscribe to my 100% spam-free newsletter!
Source: https://garba.org/posts/2022/logistic_regression/
0 Response to "Sklearn Logistic Regression With Continuous Y"
Post a Comment