PyTorch - how it is designed and why

tutorial explanation ai ml python pytorch

what’s pytorch

Pytorch is a pretty intuitive tensor library which can be used for creating neural networks. There are many features in the framework, and core ideas that should be understood before one can use the library effectively.

The original tutorial by pytorch provides a very good introduction that guides the users along different concepts, explaining the different abstraction used in the framework. It was a pretty involved read, and assumes some knowledge on neural networks before everything on the page makes sense. So here, we will be filling in some of these gaps.

The sniplets of code are all obtained from the tutorial page. Description and additional information are added to aid the understanding of the whole “using tensor to build neural network” concept.

Each section of the tutorial will be accompanied by code, and ideally, after the whole tutorial, we should have a rough idea of how the pytorch library was implemented, and should be able to do so in a similar manner, if we were to make a library like pytorch ourselves.

There will be little notes littered throughout the post, and these provide some explaination of the neural network related ideas that might be a good read for complete understanding.

numpy: neural networks as matrices

❗ this section has no pytorch yet. but this it the most crucial step, make sure you understand every single sentence here before moving on, this will make your life so much easier, trust me on this 👍

For the whole of the tutorial, we will be using this simple network as an example, and figure out which part pytorch has abstracted out. Along the way, learning to use the different pytorch modules.

This network consist of a total of 2 layers, including the output layer: input -> hidden -> ouptut. The input layer consist of 1000 values, the hidden has 100 and the output has 10.

We assume that the layers are connected to each other linearly, i.e. we taking a linear sum of all the values of the “in” vector to form the final “out” vector. This operation is represented by the “->” that links the different layers together.

Intuitively, this operation can be represented by a matrix of weights, which we will be tweaking. And in python, the very natural way to do matrices operations is to make us of numpy.

# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

key idea: backpropagation

In the bottom section of the code snipplet, we repeatedly update the different weight values so that we will ideally get values that generates outputs that is closer to the “actual” values.

To do this, neural networks employs the backpropagation. Each operation can be thought of as a matrix multiplication, and we can track how much a single value within the weight matrix would affect the final value.

Intuitively, if we were to slowly jig the values towards the optimal value, it would makes sense to “shift” those values that affected the final output more than those that doesn’t really affect at all, but what would be a good measure of this “how much it affects” metric?

Gradient can be thought of as a measure of how much a variable affects another - it can serve as a good estimate of how much to “jig” our current values to shift it toward the optimal value. And that gradient finding is all done in the bottom snipplet of the code.

It can be a little confusing how we got all those gradient formulas, so let me guide you through it. First we go the forward direction and figure out how we got our final predicted vector.

disclaimer: the symbols here are used leniently, just to provide an intuition to how the gradients are found, so most of it is represented the same way as it was presented in the original code.

The two sets of values that we are trying to figure out here is the gradient values of the two weight vectors, w_1 and w_2. We do that by working backwards, starting with the result of the loss function, L.

📌 Go through the formulas and really try to match them, this will help you understand why the code is written in this way. For more details on backpropagation, you can view this tutorial on backpropagation

tensors: upgrading from cpu

The example code here is not much different from the previous one. Why? Because for a start we are just reimplementing all the numpy arrays. The reason for doing this is so that we can make use of different processors to run our calculations.

So imagine you want to write your own neural network library, the first thing you wanted to do is to make sure that it can run on GPU and have all those speedy add-ons. The best way to do that is the rewrite how the arrays are represented and stored.

This means rewritten all the matrix functions on your own, and that was what pytorch did and they got tensors.

# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

autograd: ridding of that pesky gradient calculation

If you look at all the code we have so far, and the explanation that we went over, the most tedious part of all would be the backpropagation. And intuitively, this is the part that was the easiest to automate as well.

Every function will have its own “reverse function” that could be used automatically calculate the gradient. So instead of doing the math by hand every single time, we could track all the functions that a certain tensor was exposed to, and then “reverse” that function to get the gradient.

And pytorch has that all implemented for us, and they call it autograd.

# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

As you can see, the key here will be to set the tensor to requires_grad=True, this will tell the tensor to keep track of all those functions that we need to “reverse” later on, when we call the backwards() function.

How do you think the backwards() function works then? Intutively, you can imagine that the loss tensor is chained and has pointers to all the different functions and all the different values that is used in its calculations.

When the function is called, the gradients of the different tensors will be updated as the backwards() call walks through each function and each tensor backwards, updating them as it goes along.

That is the reason why it is important to call the w1.grad.zero_() function to ensure that everthing is reset, so that we get the right value when the backwards() function traces back and update the gradient value.

autograd: rewriting your own “reverse” function

Say you have created this amazing tensor library, and the framework that does all the backpropagation for you. The last thing you would want to do is to take on the job to write in all the functions and its “reverse” equivalent.

That is why the pytorch library exposes the Function class for us to write our own functions. Here, we will be writing the activator function, where we cut off all values below 0.

Intuitive, what we are doing here is during the forward phase, we store all the values that we need to later do the “reverse” calculation. Then later on, when they are trying to trace the backwards function, do the right calculations to return the correct values - the gradient.

# -*- coding: utf-8 -*-
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

key ideas: activation function

So activation functions are functions where is primary purpose is to cull out values so that we can reduce the amount of values that we need to update later on. Basically we want to be able to take out values that are not important and just throw them away.

There are many types of activation functions and here, we are only using the relu function, which is a complicated way of saying “removing anything below 0”. Of course there are many other classes of functions, each with its different uses, and it would be crazy to delve into them here.

So if you are interested you can head over to here to learn more about them.

nn: more abstractions, MOARRRRRRR

So the thing is programmers are lazy people. We want to reduce the number of times we type code, and for that, we are willing to type more code, create more modules, create higher level abstraction.

At this level, we are trying to avoid actually having to do the multiplication by ourselves, and how do we do that, we wrap everything up into torch.nn.Linear(). And then how do we link these different “layers” together?

We have yet another layer that wraps everything together, and that would be torch.nn.Sequential, basically that is telling the framework to “do one layer, pass the result to the next”, so on and so forth.

Of course, loss function calculations are abstracted away at this level too. No more trying to do matrix calculations on our own. Here, we will the MSELoss (mean square error), which is very close to what we have in the previous examples.

Here, since everything is abstracted away, to handle the backpropagation part, we introduce a few new functions, like zero_grad() and parameters(), which is used to then update our weight tensors.

# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

📌 At this point, these are the different level of abstractions we went through

  • numpy how neural networks can be represented by matrix multiplication
  • torch.tensor make matrix multiplication faster
  • torch.autograd abstract away backpropagation
  • torch.autograd.Function writing your own reverse function
  • torch.nn abstract away all the tensors operations

optim: add parameters and controls to backpropagation

At this point, we would have a framework that supports quite a lot of of matrix operations already. Here, we will shift our focus the backpropagation aspect of neural networks. Of course, all these, being math functions means that we can alter and be able to modify them so that we can update the gradients in anyway we want.

There are many way to update the values in the neural networks. One of the most common method used is the stochastic gradient descend (used in the later section), where we can control the amount to update the value by for at each round.

In the example here, the Adam Optimizer is used, and basically that is supposed to update the values in the network. To all these will be hidden behind the step() function.

# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algorithms. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

module: gluing different layers, with different glue

At this point, the framework is pretty complete. But the people at pytorch asked themselves, can we push it further? Can we fine tune how the layers interact with each other? What if we don’t want to just apply them sequentially?

Here, in the torch.nn.Module, we can creates all the layers we need, and then glue them in however we want to using the forward() function. Here, instead of using nn.Sequential as the glue, we can manually applying the different layers.

# -*- coding: utf-8 -*-
import torch

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

After this layer of abstraction, you can see that the code suddenly becomes very intuitive. And that everything is split into 3 different aspects - which is exact what neural networks are basically comprised of.

  1. the model (TwoLayerNet)
  2. the loss function (MSELoss)
  3. backpropagation (optim.SGD)

All of the abstraction done 👍

module: example of a different type of glue

We mentioned in the earlier section that module allows you to connect the layers in different ways. But what is the use of this? In what case do we need to glue the different layers differently?

Here, we have an example where instead of having just one hidden layer, you can choose to go through that layer a random number of times. I mean, I am not saying that this kind of neural network will be useful, but it certainly does make it hard for typical “glue” to connect the layers together.

# -*- coding: utf-8 -*-
import random
import torch

class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Here, we can see that it is not hard to figure out what the model is trying to do. The first linear layer is applied to the input, then the middle layer is applied a random number of times, then the output layer is applied once.

See, when you have abstraction at this level, everything becomes very easy to understand. And then if you actually look at the lines that does the work, it is actually just doing 3 things again:

  1. model(x) run the model with the input
  2. loss.backwards() trace the loss backwards to calculate the gradient
  3. optimzer.step() update the values using whatever method you choose

some final words

So most of the stuff that you should have learnt should be all in the content above, so I am not going to repeat them. But in this section, I am gonna give a rough idea behind why I decided to do up this post.

Most of the time, when people learn code related things, most just search for code snipplets, paste them in, and see if it works. However, for understanding concepts, it is important for us to understand why the library is implemented in a certain way.

This allows us to really utilize the framework as it was designed. And having a complete overview of the whole frameworks makes understanding code a lot easier as well.

By understanding which part is abstracted away and in which portion it lies in, we can efficiency make any changes that we need. Say you want to add a new layer, or change the way the loss function works, all these can be easily done if you know where to look for it.

Good luck, have fun 👍