In this post, we’ll explore linear regression in detail by working with the Possum Regression Dataset from Kaggle. Specifically, we’ll predict the size of a possum’s head (response variable) based on its body size (predictor variable). We’ll use Python’s Scikit-learn library for computations and matplotlib for visualizations.
Linear regression is one of the most fundamental techniques in statistics and machine learning. It is used to model and understand the relationship between two variables: a dependent variable (y) and one or more independent variables (x) . In its simplest form, simple linear regression uses a single independent variable to predict a dependent variable.
Mathematically, it’s represented as:
Where:
When multiple independent variables are used, the method is called multiple linear regression.
Understanding linear regression is crucial for several reasons:
The Possum Regression Dataset from Kaggle contains detailed measurements of possums from different regions in Australia and New Guinea. The dataset includes variables like head length, body length, tail length, foot length, and gender. For this exercise, we’ll focus on two specific columns:
Our goal is to predict the head length of a possum based on its total body length. This relationship is useful because body length is often easier to measure in field studies, and if a strong linear relationship exists, head length can be estimated accurately without direct measurement.
What We’re Going to Do
By the end of this exercise, you’ll gain a deeper understanding of how linear regression works, how to implement it using Python, and how to interpret its results in a practical context. This hands-on example demonstrates how even simple models like linear regression can extract valuable insights from data.
First, let’s load the dataset and visualize the relationship between body size and head size.
import kagglehub
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
path = kagglehub.dataset_download("abrambeyer/openintro-possum")
data = pd.read_csv(f"{path}/possum.csv", delimiter=",")
# Extract relevant columns
x = data['totlngth'] # Total body length
y = data['hdlngth'] # Head length
# Scatter plot of the data
plt.scatter(x, y, alpha=0.7, label='Data points')
plt.title("Possum Body Length vs. Head Length")
plt.xlabel("Body Length (cm)")
plt.ylabel("Head Length (cm)")
plt.legend()
plt.show()
Resulting Plot:
This scatter plot shows how body length and head length are related. We observe a roughly linear trend, which makes linear regression a suitable modeling choice.
To illustrate the concept, we start with an arbitrary guess for w = 1 and b = 1 . This line likely won’t fit the data well but gives us a starting point.
# Plot data points and initial guess
w_init, b_init = 1, 1
y_pred_init = w_init * x + b_init
plt.scatter(x, y, alpha=0.7, label='Data points')
plt.plot(x, y_pred_init, color='red', label='Initial Line (w=1, b=1)')
plt.title("Initial Guess for Regression Line")
plt.xlabel("Body Length (cm)")
plt.ylabel("Head Length (cm)")
plt.legend()
plt.show()
Resulting Plot:
The red line represents our initial guess, which clearly doesn’t match the data well. The error (distance between actual points and the line) is significant.
Before diving into the code, let’s understand gradient descent, a fundamental optimization algorithm used in machine learning to minimize errors.
What is Gradient Descent?
Gradient descent is an iterative method to find the optimal values of parameters ( w and b in this case) that minimize a loss function. Here, the loss function is the Mean Squared Error (MSE):
The goal is to adjust w (slope) and b (intercept) to make the predicted y as close as possible to the actual y values in the dataset.
Gradient descent achieves this by:
1. Calculating the gradient (partial derivatives) of the loss function with respect to w and b .
2. Updating w and b in the direction that reduces the error:
Here, alpha is the learning rate, which determines the step size for updates.
3. Repeating this process until the loss converges to a minimum or the parameters stabilize.
Why Gradient Descent Works
Imagine the loss function as a valley, with the lowest point representing the optimal values of w and b . Gradient descent starts at a random point on this “valley” and iteratively moves downward by following the slope (gradient). Over time, it reaches the bottom, which corresponds to the best-fit line for the data.
Importance of the Learning Rate ( alpha )
With gradient descent, we can systematically find the best-fit line for our data, reducing the error and making accurate predictions. Next, we’ll see how Scikit-learn handles this optimization process for us.
Let’s use Scikit-learn to find the optimal values of w and b.
from sklearn.linear_model import LinearRegression
import numpy as np
# Reshape x for sklearn (expects 2D array)
x_reshaped = x.values.reshape(-1, 1)
# Train the model
model = LinearRegression()
model.fit(x_reshaped, y)
# Get optimized parameters
w_opt = model.coef_[0]
b_opt = model.intercept_
print(f"Optimized parameters: w = {w_opt:.3f}, b = {b_opt:.3f}")
# Optimized parameters: w = 0.573, b = 42.710
Finally, we plot the regression line using the optimized w and b.
# Predict using the optimized line
y_pred_opt = w_opt * x + b_opt
# Plot data points and optimized line
plt.scatter(x, y, alpha=0.7, label='Data points')
plt.plot(x, y_pred_opt, color='green', label='Optimized Regression Line')
plt.title("Optimized Regression Line")
plt.xlabel("Body Length (cm)")
plt.ylabel("Head Length (cm)")
plt.legend()
plt.show()
Resulting Plot:
The green line represents the optimized regression line. This line fits the data well, minimizing the error and capturing the relationship between body length and head size.
We successfully modeled the relationship between a possum’s body length and head size using linear regression. Along the way, we visualized the data, made an initial guess, and refined our predictions with gradient descent. Linear regression is just the beginning—future models can capture more complex relationships, but this foundational understanding is key.