Pytorch Tutorial for Beginners: Build multiple regression models from scratch

0 1 11 minutes read

Pytorch Tutorial for Beginners: Build multiple regression models from scratch

before the lllms suffered, there was it is almost visible line A machine learning framework that differentiates it from deep learning.

This talk was focused on skikit-funda, XGboost, and similar to ML, and Pytorch and tensorflow dominated the scene where it was a deep thing.

After the AI explosion, I've been seeing Pytorch dominating the space over tensorflow. Both of these frameworks have great potential, enabling data scientists to solve various types of problems, natural language processing being one of them, so the popularity of deep learning is increasing.

In this post on the post, my opinion should not talk about NLP, but instead, I will work on the problem of reversing the various relationships with two goals in mind:

To teach you how to build a model using Pytorch
Sharing information about direct regression that is not always available in other tutorials.

Let's go in.

Preparing the data

Ok, let me save you with a lovely explanation of line restoration. You've probably seen it over and over in countless tutorials all over the Internet. So, enough of that If you have a variable y that you want to predict and another variable X that can explain the variation of y using a straight line, that is, in effect, a straight line regression.

Dataset measurement

In this exercise, let's use abalone data [1].

Nash, W., Sellers, T., TATBOT, S., Cawborn, A., & Ford, W. (1994). See you [Dataset]. UCI's machine learning environment.

According to the dataset documentation, The age of abalone is determined by cutting the shell into pieces, inserting it, and counting the number of rings under a microscope, a difficult and time-consuming task. Other measurements, which are easy to find, are used to predict age.

So, let's proceed to upload the details. In addition, we will include one variable variable Sexbecause it is the only one.

# Data Load
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from feature_engine.encoding import OneHotEncoder

# fetch dataset
abalone = fetch_ucirepo(id=1)

# data (as pandas dataframes)
X = abalone.data.features
y = abalone.data.targets

# One Hot Encode Sex
ohe = OneHotEncoder(variables=['Sex'])
X = ohe.fit_transform(X)

# View
df = pd.concat([X,y], axis=1)

Here is the data.

Data subject. Photo by the author.

So, to build a better model, let's examine the details.

Checking details

The first steps I like to do when testing a dataset are:

1. Assessing the distribution of the target variable.

# Looking at our Target variable
plt.hist(y)
plt.title('Rings [Target Variable] Distribution');

The diagrams show that the target variable is not closed. That can affect the regression, but can usually be corrected with a power transformation, such as log or box-cox.

Dynamic distribution. Photo by the author.

2. Look at the meaning of the equation.

Statistics can show us important information such as mean, standard deviation, and easily see some differences from others in terms of small or high values. The explanatory variable is correct, within a small range, and the same measure. The target variable (Rings) is at a different level.

# Statistical description
df.describe()

Statistical definition. Photo by the author.

Next, let's check the connection.

# Looking at the correlations
(df
 .drop(['Sex_M', 'Sex_I', 'Sex_F'],axis=1)
 .corr()
 .style
 .background_gradient(cmap='coolwarm')
)

The explanatory variables have a moderate to strong balance Rings. We can also see that there are others involved Whole_weight and Shucked_weight, Viscera_weightagain Shell_weight. Length and Diameter they are also collinear. We can check to remove them later.

sns.pairplot(df);

When we plot pairs approts and look at the relationship of variables with Ringswe can quickly identify certain problems

The assumption of homoscedasticity is violated. This means that the relationship is not homogeneous in terms of diversity.
Notice how the farms form a lumpy structure, which increases the variance of y as the values of x increase. When measuring the value of Rings For high values of the unique X, the estimate will not be very good.
Diversity Height at least two of the most visible exits there Height > 0.3.

Re-browsing makes no difference. Photo by the author.

Removing the vendors and converting the target variable to logarithms will result in the following pairwise arrangement. It's better, but it still doesn't solve the problem of homoscedasticity.

Doubling after the revolution. Photo by the author.

Another quick test we can do to program graphics alternatives is to test the relationships of variables when grouped by Sex Variable.

Diversity Diameter you have a very good direct relationship there Sex=Ibut that's all.

# Create a FacetGrid with scatterplots
sns.lmplot(x="Diameter", y="Rings", hue="Sex", col="Sex", order=2, data=df);

On the other hand, Shell_weight You have too much dispersion with high values, which distort the linear relationship.

# Create a FacetGrid with scatterplots
sns.lmplot(x="Shell_weight", y="Rings", hue="Sex", col="Sex", data=df);

Shell_weight x rings. Photo by the author.

All of this shows that a direct regression model would be a big challenge for these data, and will probably fail. But we still want to do it.

By the way, I don't remember seeing a post where we actually went through something that went wrong. So, by doing this, we can learn important lessons.

Rating: Using Skikit-Funda

Let's run the skLelern model and test it using a rooted error.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

df2 = df.query('Height < 0.3 and Rings > 2 ').copy()
X = df2.drop(['Rings'], axis=1)
y = np.log(df2['Rings'])

lr = LinearRegression()
lr.fit(X, y)

predictions = lr.predict(X)

df2['Predictions'] = np.exp(predictions)
print(root_mean_squared_error(df2['Rings'], df2['Predictions']))

2.2383762717104916

If we look at the header, we can confirm that the model is fighting for high value ratios (eg lines 0, 6, 7, and 9).

Head on predictions. Photo by the author.

One step back: He tries other changes

That's right. So what can we do now?

Maybe remove other sellers and try again. Let's try to use a non-secure algorithm to find some links. We will use the External Feature of Locationdiscarding 5% of sellers.

We will also remove multicollinearity, throwing Whole_weight and Length.

from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# fetch dataset
abalone = fetch_ucirepo(id=1)

# data (as pandas dataframes)
X = abalone.data.features
y = abalone.data.targets

# One Hot Encode Sex
ohe = OneHotEncoder(variables=['Sex'])
X = ohe.fit_transform(X)

# Drop Whole Weight and Length (multicolinearity)
X.drop(['Whole_weight', 'Length'], axis=1, inplace=True)

# View
df = pd.concat([X,y], axis=1)

# Let's create a Pipeline to scale the data and find outliers using KNN Classifier
steps = [
('scale', StandardScaler()),
('LOF', LocalOutlierFactor(contamination=0.05))
]
# Fit and predict
outliers = Pipeline(steps).fit_predict(X)

# Add column
df['outliers'] = outliers

# Modeling
df2 = df.query('Height < 0.3 and Rings > 2 and outliers != -1').copy()
X = df2.drop(['Rings', 'outliers'], axis=1)
y = np.log(df2['Rings'])

lr = LinearRegression()
lr.fit(X, y)

predictions = lr.predict(X)

df2['Predictions'] = np.exp(predictions)
print(root_mean_squared_error(df2['Rings'], df2['Predictions']))

2.238174395913869

Same result. Hmm….

That's right. We can continue to play with variables and feature engineering, and we'll start to see some improvements here and there, like when we square the Height, Diameteragain Shell_weight. That added dealer treatment will decrease the RMSE so 2.196.

# Second Order Variables
X['Diameter_2'] = X['Diameter'] ** 2
X['Height_2'] = X['Height'] ** 2
X['Shell_2'] = X['Shell_weight'] ** 2

Certainly, it is fair to note that all variables added to linear regression models will affect the R² and sometimes adding an effect, gives the false impression that the model is improving, when it is not. In this case, the model actually improves, because we add some non-linear elements to it with the second variable. We can prove that by calculating the corrected R². It went from 0.495 to 0.517.

# Adjusted R²
from sklearn.metrics import r2_score

r2 = r2_score(df2['Rings'], df2['Predictions'])
n= df2.shape[0]
p = df2.shape[1] - 1
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'R²: {r2}')
print(f'Adjusted R²: {adj_r2}')

On the other hand, to retrieve Whole_weight and Length it might improve a few more numbers, but I wouldn't recommend it. If we do that, we increase multicollinearity and reduce the importance of some variable coefficients, which leads to possible errors in the future.

Benchmarking: Using Pytorch

That's right. Now that we have a basic model made, the idea is to create an exact model using deep learning and try to beat the RMSE of 2.196.

On the right. To start, let me say this up front: Deep learning models work best with measured data. However, since the X variables are all at the same level, we don't need to worry about that. So let's keep going.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

We need to prepare data for modeling with Pytorch. Here, we need some adjustments to make the data acceptable with the Pytorch framework, because it won't take pandas for standard pandas.

Let's use the same data frame from our model base.
Divide x and y
Convert the y variable to log
Change both numpy bits, because PyTorch won't move DataFrames.

df2 = df.query('Height < 0.3 and Rings > 2 and outliers != -1').copy()
X = df2.drop(['Rings', 'outliers'], axis=1)
y = np.log(df2[['Rings']])

# X and Y to Numpy
X = X.to_numpy()
y = y.to_numpy()

Next, you use TensorDatasetYou make X and y a tensor object, and print the result.

# Prepare with TensorData
# TensorData helps us transforming the dataset to Tensor object
dataset = TensorDataset(torch.tensor(X).float(), torch.tensor(y).float())

input_sample, label_sample = dataset[0]
print(f'** Input sample: {input_sample}, n** Label sample: {label_sample}')

** Input sample: tensor([0.3650, 0.0950, 0.2245, 0.1010, 0.1500, 1.0000, 
0.0000, 0.0000, 0.1332, 0.0090, 0.0225]), 
** Label sample: tensor([2.7081])

Then, using the data batch function, we can create batches of data. This means that the neural network will deal with batch_size amount of data at a time.

# Next, let's use DataLoader
batch_size = 500
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

Pytorch models are best described as classes.

The class is based on nn.Modulewhich is Pytorch's base class for neural networks.
We define the The model layers we want to use in the implementation method.
- super().__init__() Ensures that the class will behave like a torch object.
This page forward The method is descriptive What happens to the input when it is passed to the model.

Here, we pass it through the vertical layers we described in the first method, and use the Relu function to add some reduction to the model in the forward pass.

# 2. Creating a class
class AbaloneModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.linear1 = nn.Linear(in_features=X.shape[1], out_features=128)
    self.linear2 = nn.Linear(128, 64)
    self.linear3 = nn.Linear(64, 32)
    self.linear4 = nn.Linear(32, 1)

  def forward(self, x):
    x = self.linear1(x)
    x = nn.functional.relu(x)
    x = self.linear2(x)
    x = nn.functional.relu(x)
    x = self.linear3(x)
    x = nn.functional.relu(x)
    x = self.linear4(x)
    return x

# Instantiate model
model = AbaloneModel()

Next, let's try the model for the first time using a script that simulates a random search.

Create an error procedure to test the model
Create a list to hold the data from the best model and set best_loss As the maximum value, so it will be replaced by the best loss numbers during the iteration.
Set the range of the reading value. We will use power factors from 2 to 4 (eg from 0.01 to 0.0001).
Set the reading range from 0.9 to 0.99.
Get the data
Zero the gradient to clear the degree calculation from the previous iteration.
Enter the model
Enter the lost and register the best model numbers.
Identify instruments and research by going back.
ITETATE N Times and print a great model.

# Mean Squared Error (MSE) is standard for regression
criterion = nn.MSELoss()

# Random Search
values = []
best_loss = 999
for idx in range(1000):
  # Randomly sample a learning rate factor between 2 and 4
  factor = np.random.uniform(2,5)
  lr = 10 ** -factor

  # Randomly select a momentum between 0.85 and 0.99
  momentum = np.random.uniform(0.90, 0.99)

  # 1. Get Data
  feature, target = dataset[:]
  # 2. Zero Gradients: Clear old gradients before the backward pass
  optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
  optimizer.zero_grad()
  # 3. Forward Pass: Compute prediction
  y_pred = model(feature)
  # 4. Compute Loss
  loss = criterion(y_pred, target)
  # 4.1 Register best Loss
  if loss < best_loss:
    best_loss = loss
    best_lr = lr
    best_momentum = momentum
    best_idx = idx

  # 5. Backward Pass: Compute gradient of the loss w.r.t W and b'
  loss.backward()
  # 6. Update Parameters: Adjust W and b using the calculated gradients
  optimizer.step()
  values.append([idx, lr, momentum, loss])

print(f'n: {idx},lr: {lr}, momentum: {momentum}, loss: {loss}')

n: 999,lr: 0.004782946959508322, momentum: 0.9801209929050066, loss: 0.06135804206132889

As soon as we get a good balance of learning and momentum, we can move forward.

# --- 3. Loss Function and Optimizer ---

# Mean Squared Error (MSE) is standard for regression
criterion = nn.MSELoss()

# Stochastic Gradient Descent (SGD) with a small learning rate (lr)
optimizer = optim.SGD(model.parameters(), lr=0.004, momentum=0.98)

After that, we will be able to retrain this model, using the same steps as before, but this time keeping the same learning rate and intensity.

Estimating a Pytorch model requires a longer printer than usual fit() path from skikit-funda. But it's not a big deal. The structure will always be similar to these steps:

Carry out model.train() habit
Create a loop for the number of iterations you want. Each iteration is called an epoch.
Zero gradients from past past with optimizer.zero_grad().
Get batches from the database.
Identify predictions with model(X)
Calculate the loss using criterion(y_pred, target).
Make a pass back to collect the instruments and select: loss.backward()
Review instruments and options with optimizer.step()

We will train this model for 1000 epochs (iterations). Here, we can only add a step to find the best model at the end, so we make sure to use the model with the best loss.

# 4. Training
torch.manual_seed(42)
NUM_EPOCHS = 1001
loss_history = []
best_loss = 999

# Put model in training mode
model.train()

for epoch in range(NUM_EPOCHS):
  for data in dataloader:

    # 1. Get Data
    feature, target = data

    # 2. Zero Gradients: Clear old gradients before the backward pass
    optimizer.zero_grad()

    # 3. Forward Pass: Compute prediction
    y_pred = model(feature)

    # 4. Compute Loss
    loss = criterion(y_pred, target)
    loss_history.append(loss)

    # Get Best Model
    if loss < best_loss:
      best_loss = loss
      best_model_state = model.state_dict()  # save best model

    # 5. Backward Pass: Compute gradient of the loss w.r.t W and b'
    loss.backward()

    # 6. Update Parameters: Adjust W and b using the calculated gradients
    optimizer.step()

    # Load the best model before returning predictions
    model.load_state_dict(best_model_state)

  # Print status every 50 epochs
  if epoch % 200 == 0:
    print(epoch, loss.item())
    print(f'Best Loss: {best_loss}')

0 0.061786893755197525
Best Loss: 0.06033024191856384
200 0.036817338317632675
Best Loss: 0.03243456035852432
400 0.03307393565773964
Best Loss: 0.03077109158039093
600 0.032522525638341904
Best Loss: 0.030613820999860764
800 0.03488151729106903
Best Loss: 0.029514113441109657
1000 0.0369877889752388
Best Loss: 0.029514113441109657

Good. The model is being trained. Now it's time to test.

Being able to be tested

Let's see if this model does better than regular returns. For that, I will put the model in test mode by using model.eval()so Pytorch knows he needs to change the way he trains in training and finding unlock mode. It will turn off the standard warner and quit, for example.

# Get features
features, targets = dataset[:]

# Get Predictions
model.eval()
with torch.no_grad():
  predictions = model(features)

# Add to dataframe
df2['Predictions'] = np.exp(predictions.detach().numpy())

# RMSE
print(root_mean_squared_error(df2['Rings'], df2['Predictions']))

2.1108551025390625

The progress was modest, almost 4%.

Let's look at some predictions from each model.

Predictions from both models. Photo by the author.

Both models obtain similar results. They fight harder as the number of rings becomes more. That's because of the lumpy nature of the target variable.

If we assume that for a while:

As the number of rings increases, there is more variation from the explanatory variable.
An abalone with 15 rings will be in the wider range than another with 4 rings.
This confuses the model because it requires drawing a single line between non-rear data.

Before you leave

We learned a lot from this project:

How to check the data.
How to check if a vertical model would be a good choice.
How to build a Pytorch model for multivariate linear regression.

Finally, we realized that poor target dynamics, even after energy conversion, can lead to a poorly executed model. Our model is still better than shooting the average value of all predictions, but the error is still high, staying around 20% of the predicted value.

We tried to use deep learning to improve the result, but all that power was not enough to reduce the error much. I'd probably go with the Scikit-Law model, because it's simpler and more descriptive.

Other options to try to improve the results would be to create a nbsser model of the trend with random forest + linear regression. But that is a task I leave to you, if you want.

If you liked this content, I found it on my website.