Lesson 5: Logistic Regression with Scikit-learn šÆ
Objective:
- Understand the basic concept of Logistic Regression and its suitability for classification.
- Implement Logistic Regression using Scikit-learn on the Titanic dataset.
- Evaluate the modelās predictions using common classification metrics.
- Appreciate the ease of use of Scikit-learn for standard models.
Recap of Lesson 3 & 4:
- Data was preprocessed: missing values handled, categorical features one-hot encoded, numerical features scaled.
- Data was split into X_train, y_train, X_val, y_val. These are Pandas DataFrames/Series.
Weāve already built Logistic Regression from scratch to understand its mechanics. Now we see the library version.
1. What is Logistic Regression? (Conceptual Recap - 10 min)
Logistic Regression is a statistical model used for binary classification problems (predicting one of two outcomes, e.g., Survived/Died). Despite āRegressionā in its name, itās a classification algorithm.
Core Idea: It models the probability that an input belongs to a particular class. Sigmoid (Logistic) Function: It uses the sigmoid function to squash the output of a linear equation (similar to linear regression: z=Wā X+b) into a probability between 0 and 1.
\[Ļ(z)=1+eāz\]1 ā
The output Ļ(z) is P(y=1ā£X;W,b). Decision Boundary: A threshold (typically 0.5) is applied to this probability to make a class prediction: If P(y=1)ā„0.5, predict class 1. If P(y=1)<0.5, predict class 0. Suitability: Itās well-suited for binary classification and provides probability estimates, which can be very useful. It assumes a linear relationship between the features and the log-odds of the outcome.
2. Why Use Scikit-learnās Logistic Regression? (Conceptual - 5 min)
While building from scratch (Lesson 4) is excellent for understanding, Scikit-learn offers:
- Optimization: Highly optimized algorithms (solvers like āliblinearā, ālbfgsā, āsagaā) for faster training.
- Features: Built-in support for regularization (L1, L2), handling multi-class problems, and more.
- Ease of Use: A consistent API for training, predicting, and evaluating models.
- Robustness: Well-tested and widely used. This lesson demonstrates how quickly we can implement and test a standard model using such a library.
3. Implementation with Scikit-learn (Practical - 20 min)
Weāll use the LogisticRegression model from sklearn.linear_model.
# Ensure necessary libraries and data are loaded from previous lessons
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression # Changed from LinearRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
# import matplotlib.pyplot as plt # For potential plotting if needed
# --- Assume X_train, y_train, X_val, y_val are pre-existing Pandas DataFrames/Series ---
# For this script to be runnable standalone, let's create dummy preprocessed data
# In your actual workflow, you'd use the data from your previous lessons.
if 'X_train' not in locals() or 'y_train' not in locals():
print("--- X_train, y_train not found. Creating dummy preprocessed data for Lesson 5. ---")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
num_samples = 891
num_features = 8
dummy_X_data = np.random.rand(num_samples, num_features)
feature_names = [f'feature_{i}' for i in range(num_features)]
X_full = pd.DataFrame(dummy_X_data, columns=feature_names)
scaler = StandardScaler()
X_scaled_data = scaler.fit_transform(X_full)
X_full_scaled = pd.DataFrame(X_scaled_data, columns=feature_names)
y_full = pd.Series(np.random.randint(0, 2, num_samples), name='Survived')
X_train, X_val, y_train, y_val = train_test_split(
X_full_scaled, y_full, test_size=0.2, random_state=42, stratify=y_full
)
print(f"Dummy X_train shape: {X_train.shape}, Dummy y_train shape: {y_train.shape}")
# 1. Initialize the Logistic Regression model
# Common parameters:
# solver: Algorithm to use in the optimization problem. Default is 'lbfgs'.
# 'liblinear' is good for small datasets. 'saga' supports L1 and L2.
# C: Inverse of regularization strength; smaller values specify stronger regularization. Default is 1.0.
# max_iter: Maximum number of iterations taken for the solvers to converge.
log_reg_model = LogisticRegression(solver='liblinear', random_state=42, max_iter=200)
# 2. Train the model
# Scikit-learn's LogisticRegression expects y_train to be a 1D array or Series.
print("\n--- Training Logistic Regression Model (Scikit-learn) ---")
log_reg_model.fit(X_train, y_train)
print("Logistic Regression model trained.")
# You can inspect the learned coefficients (weights) and intercept if interested
# print(f"Coefficients (W): {log_reg_model.coef_}")
# print(f"Intercept (b): {log_reg_model.intercept_}")
# 3. Make predictions
# .predict() gives class labels directly (0 or 1)
y_pred_class_train_logreg = log_reg_model.predict(X_train)
y_pred_class_val_logreg = log_reg_model.predict(X_val)
# .predict_proba() gives probabilities for each class [P(class=0), P(class=1)]
y_pred_proba_val_logreg = log_reg_model.predict_proba(X_val)
print("\nFirst 5 class predictions (0 or 1) on validation set:")
print(y_pred_class_val_logreg[:5])
print("\nFirst 5 probability predictions [P(Died), P(Survived)] on validation set:")
print(y_pred_proba_val_logreg[:5])
# We are interested in P(Survived), which is the second column
print("\nFirst 5 probabilities of 'Survived' on validation set:")
print(y_pred_proba_val_logreg[:5, 1])
4. Evaluation (Practical - 15 min)
We use standard classification metrics.
# (Continuing from the previous code block)
print("\n--- Logistic Regression Model Evaluation (Scikit-learn) ---")
# Accuracy
train_accuracy_logreg = accuracy_score(y_train, y_pred_class_train_logreg)
val_accuracy_logreg = accuracy_score(y_val, y_pred_class_val_logreg)
print(f"Training Accuracy: {train_accuracy_logreg*100:.2f}%")
print(f"Validation Accuracy: {val_accuracy_logreg*100:.2f}%")
# Confusion Matrix for Validation Set
print("\nValidation Set Confusion Matrix:")
cm_logreg = confusion_matrix(y_val, y_pred_class_val_logreg)
print(cm_logreg)
# For better display:
TN = cm_logreg[0,0]; FP = cm_logreg[0,1]
FN = cm_logreg[1,0]; TP = cm_logreg[1,1]
print(f"TN: {TN}, FP: {FP}, FN: {FN}, TP: {TP}")
# Precision, Recall, F1-Score for Validation Set
if len(np.unique(y_val)) > 1: # Ensure there are positive and negative classes in true labels
precision_logreg = precision_score(y_val, y_pred_class_val_logreg, zero_division=0)
recall_logreg = recall_score(y_val, y_pred_class_val_logreg, zero_division=0)
f1_logreg = f1_score(y_val, y_pred_class_val_logreg, zero_division=0)
print(f"Validation Precision: {precision_logreg:.4f}")
print(f"Validation Recall: {recall_logreg:.4f}")
print(f"Validation F1-Score: {f1_logreg:.4f}")
else:
print("Could not calculate Precision/Recall/F1 for validation set (e.g., only one class in true labels).")
# (Optional) ROC Curve and AUC Score - Good for evaluating probability-based classifiers
# from sklearn.metrics import roc_curve, roc_auc_score
# y_pred_proba_survived_val = y_pred_proba_val_logreg[:, 1] # Probabilities for the positive class (Survived)
# fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba_survived_val)
# auc_score = roc_auc_score(y_val, y_pred_proba_survived_val)
# print(f"Validation AUC Score: {auc_score:.4f}")
# plt.figure(figsize=(8,6))
# plt.plot(fpr, tpr, label=f"Logistic Regression (AUC = {auc_score:.2f})")
# plt.plot([0, 1], [0, 1], 'k--') # Random guessing line
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('ROC Curve')
# plt.legend()
# plt.show()
- Discussion (Conceptual - 5 min)
- Performance: Compare the Scikit-learn Logistic Regressionās performance (accuracy, F1-score) to the from-scratch version (Lesson 4). The Scikit-learn version is likely to be similar or slightly better due to optimized solvers and default regularization.
- Ease of Use: Note how few lines of code are needed to train and predict with Scikit-learn.
- Interpretability: Logistic Regression coefficients can be interpreted in terms of log-odds, providing some insight into feature importance (though less direct than linear regression coefficients for continuous outcomes).
- Baseline Model: Logistic Regression is often a good baseline model for classification tasks. If more complex models (like neural networks) donāt significantly outperform it, the simpler model might be preferred.
Wrap-up & Next Steps:
- Recap: Weāve implemented Logistic Regression using Scikit-learn, a powerful and convenient library. We trained the model, made predictions (both class labels and probabilities), and evaluated its performance using various classification metrics.
- Teaser for Lesson 6 (formerly Lesson 5): Now that we have experience with a robust library-based classifier (Logistic Regression) and a from-scratch implementation, weāll transition to understanding the Core Concepts of Neural Networks. This will prepare us to build more complex and potentially more powerful models that can capture non-linear relationships in data.