1. Background
A simple and familiar problem: a linear regression with a single feature.
Simple linear regression model:
2. Import libraries and make preparation
- Libraries we need in the demo
import numpy as np
from sklearn.linear_model import LinearRegression
import torch
import torch.optim as optim
import torch.nn as nn
from torchviz import make_dot
import matplotlib.pyplot as plt
- Make preparations, some custom functions like plot
# Make preparations, some custom functions like plot
def figure1(x_train, y_train, x_val, y_val):
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
ax[0].scatter(x_train, y_train)
ax[0].set_xlabel('x')
ax[0].set_ylabel('y')
ax[0].set_ylim([0, 3.1])
ax[0].set_title('Generated Data - Train')
ax[1].scatter(x_val, y_val, c='r')
ax[1].set_xlabel('x')
ax[1].set_ylabel('y')
ax[1].set_ylim([0, 3.1])
ax[1].set_title('Generated Data - Validation')
fig.tight_layout()
return fig, ax
3. Data Generation
- 2-1) Let’s start generating some synthetic data
We start with a vector of 100 (N) points for our feature (x) and create our labels (y) using b = 1, w = 2,
and some Gaussian noise(epsilon).
# Synthetic Data Generation
true_b = 1
true_w = 2
N = 100
# Data Generation
np.random.seed(42)
x = np.random.rand(N, 1)
epsilon = (.1 * np.random.randn(N, 1))
y = true_b + true_w * x + epsilon
- 2-2) Split data into train and validation sets
Next, let’s split our synthetic data into train and validation sets, shuffling the array of indices and using the first 80 shuffled points for training.
# Shuffles the indices
idx = np.arange(N)
np.random.shuffle(idx)
# Uses first 80 random indices for train
train_idx = idx[:int(N*.8)]
# Uses the remaining indices for validation
val_idx = idx[int(N*.8):]
# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]
# using plot to draw train and validation data
figure1(x_train, y_train, x_val, y_val)
Result:
4. Gradient Descent
- 4-1) Random Initialization
For training a model, you need to randomly initialize the parameters / weights(in this example, we have only two, b and w).
# Step 0 - Initializes parameters "b" and "w" randomly
np.random.seed(42)
b = np.random.randn(1)
w = np.random.randn(1)
print(b, w)
Output:
[0.49671415] [-0.1382643]
- 4-2) Compute Model’s Predictions
This is the forward pass; it simply computes the model’s predictions using the current values of the parameters / weights. At the very beginning, we will be producing really bad predictions, as we started with random values.
# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train
- 4-3) Compute the Loss
For a regression problem, the loss is given by the mean squared error (MSE); that is, the average of all squared errors; that is, the average of all squared differences between labels (y) and predictions (b + wx).
In the code below, we are using all data points of the training set to compute the loss, so n = N = 80, meaning we are performing batch gradient descent.
# Step 2 - Computing the loss
# We are using ALL data points, so this is BATCH gradient
# descent. How wrong is our model? That's the error!
error = (yhat - y_train)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()
print(loss)
Output:
2.720278897826747
- 4-4) Compute the Gradients
A gradient is a partial derivative.
A derivative tells you how much a given quantity changes when you slightly vary some other quantity.
Gradient = how much the loss changes if ONE parameter changes a little bit
# Step 3 - Computes gradients for both "b" and "w" parameters
b_grad = 2 * error.mean()
w_grad = 2 * (x_train * error).mean()
print(b_grad, w_grad)
Output:
-3.044811379650508 -1.8337537171510832
- 4-5) Update the Parameters
In the final step, we use the gradients to update the parameters.
Since we are trying to minimize our losses, we reverse the sign of the gradient for the update.
# Sets learning rate - this is "eta" ~ the "n" like Greek letter
lr = 0.1
print(b, w)
# Step 4 - Updates parameters using gradients and
# the learning rate
b = b - lr * b_grad
w = w - lr * w_grad
print(b, w)
Output:
[0.49671415] [-0.1382643]
[0.80119529] [0.04511107]
eg: Let’s start with a value of 0.1 for the learning rate (which is a
relatively high value, as far as learning rates are concerned!)
- 4-6) Rinse and Repeat
We use the updated parameters to go back to Step 1 and restart the process.