{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Neural Networks\n\nNeural networks can be constructed using the ``borch.nn`` package.\n\nNow that you've had a glimpse of ``autograd``, ``nn`` depends on ``autograd``\nto define models and differentiate them. An ``nn.Module`` contains layers,\nand a method ``forward(input)`` that returns the ``output``.\n\nFor example, look at this network that classifies digit images:\n\n.. figure:: /_static/img/mnist.png\n   :alt: convnet\n\n   convnet\n\nIt is a simple feed-forward network. It takes an input, feeds it through\nseveral layers one after the other, and then finally gives the output.\n\nA typical training procedure for a neural network is as follows:\n\n* Define a network that has some learnable parameters and/or randomVariables\n* For each batch in a dataset, do:\n\n  - Process the input data through the network\n  - Compute the loss (how far is the output from being correct?)\n  - Propagate gradients back into the network\u2019s parameters\n  - Update the weights of the network, typically using a simple update rule:\n    ``weight = weight - learning_rate * gradient``\n\n## Define the network\n\nLet\u2019s define this network:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torch.nn.functional as F\nimport borch\nfrom borch import distributions, posterior, nn, infer\n\n\nclass Net(nn.Module):\n    def __init__(self):\n        super(Net, self).__init__(posterior=posterior.Automatic())\n        # 1 input image channel, 6 output channels, 5x5 convolution kernel\n        self.conv1 = nn.Conv2d(1, 6, 5)\n\n        # 6 input channels, 16 output channels, 5x5 convolution kernel\n        self.conv2 = nn.Conv2d(6, 16, 5)\n\n        # An affine operation: y = Wx + b\n        # NB after two convolutional operations with 5x5 kernels and no padding,\n        # the spatial dimension of an image with intial dimension 32x32 is\n        # 5x5 (with 16 channels)\n        self.fc1 = nn.Linear(16 * 5 * 5, 120)\n        self.fc2 = nn.Linear(120, 10)\n\n    def forward(self, x):\n        # Max pooling over a (2, 2) window\n        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))\n        # If the size is a square you can only specify a single number\n        x = F.max_pool2d(F.relu(self.conv2(x)), 2)\n        x = x.view(1, self.num_flat_features(x))\n        x = F.relu(self.fc1(x))\n        x = self.fc2(x)\n        # Specifying the likelihood function\n        self.classification = distributions.Categorical(logits=x)\n        return x\n\n    def num_flat_features(self, x):\n        size = x.size()[1:]  # all dimensions except the batch dimension\n        num_features = 1\n        for s in size:\n            num_features *= s\n        return num_features\n\n\nnet = Net()\nprint(net)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You just have to define the ``forward`` function, and the ``backward``\nfunction (where gradients are computed) is automatically defined for you\nusing ``autograd``.\nYou can use any of the Tensor operations in the ``forward`` function.\n\nThe learnable parameters of a model are returned by ``net.parameters()``\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "params = list(net.parameters())\nprint(len(params))\nprint(params[0].size())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let try a random 32x32 input\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "input = torch.randn(1, 1, 32, 32)\nout = net(input)\nprint(out)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Zero the gradient buffers of all parameters and backprops with random\ngradients:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "net.zero_grad()\nout.backward(torch.randn(1, 10))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>``borch.nn`` only supports mini-batches. The entire ``borch.nn``\n    package only supports inputs that are a mini-batch of samples, and not\n    a single sample.\n\n    For example, ``nn.Conv2d`` will take in a 4D Tensor of\n    ``nSamples x nChannels x Height x Width``.\n\n    If you have a single sample, just use ``input.unsqueeze(0)`` to add\n    a fake batch dimension.</p></div>\n\nBefore proceeding further, let's recap all the classes you\u2019ve seen so far.\n\n**Recap:**\n  -  ``torch.Tensor`` - A *multi-dimensional array* with support for autograd\n     operations like ``backward()``. Also *holds the gradient* w.r.t. the\n     tensor.\n  -  ``nn.Module`` - Neural network module. *Convenient way of\n     encapsulating parameters*, with helpers for moving them to GPU,\n     exporting, loading, etc.\n  -  ``nn.Parameter`` - A kind of Tensor, that is *automatically\n     registered as a parameter when assigned as an attribute to a*\n     ``Module``.\n  -  ``autograd.Function`` - Implements *forward and backward definitions\n     of an autograd operation*. Every ``Tensor`` operation, creates at\n     least a single ``Function`` node, that connects to functions that\n     created a ``Tensor`` and *encodes its history*.\n\n**At this point, we covered:**\n  -  Defining a neural network\n  -  Processing inputs and calling backward\n\n**Still Left:**\n  -  Computing the loss\n  -  Updating the weights of the network\n\n## Loss Function\nA loss function takes the (output, target) pair of inputs, and computes a\nvalue that estimates how far away the output is from the target.\n\nThere are several different\n`loss functions <http://pytorch.org/docs/nn.html#loss-functions>`_ under the\nnn package .\nA simple loss is: ``nn.MSELoss`` which computes the mean-squared error\nbetween the input and the target.  They are how ever only equivalent to an maximum\nlikelihood approach in deep learning.\n\n\nIn order to infer the posterior of the weights and thus capture the uncertainty\nof the weights as well, we have to use the ``infer`` package. In this example we\nwill use ``infer.vi_loss`` function that automatically creates the best loss function\nfor variational inference given the latent variables in your model.\n\nSimilar to how it's done for random varibles, we can also observe on the\nmodule using keyword arguments matching the names of the random variables we\nwant to observe. This will add those random variables to the likelihood term\nand we will not infer the distribution over it.\nFor example:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "target = torch.randint(10, (1,))  # a dummy target, for example\nnet.observe(classification=target)\nborch.sample(net)\noutput = net(input)\nloss = infer.vi_loss(**borch.pq_to_infer(net))\nprint(loss)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, if you would follow ``loss`` in the backward direction you will see a graph of\ncomputations that looks like this:\n::\n\n    input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d\n          -> view -> linear -> relu -> linear ->\n          -> loss\n\nSo, when we call ``loss.backward()``, the whole graph is differentiated\nw.r.t. the loss, and all Tensors in the graph that has ``requires_grad=True``\nwill have their ``.grad`` Tensor accumulated with the gradient.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Backprop\nTo backpropagate the error all we have to do is to ``loss.backward()``.\nYou need to clear the existing gradients though, else gradients will be\naccumulated to existing gradients.\n\n\nNow we shall call ``loss.backward()``, and have a look at conv1's bias\ngradients before and after the backward.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "net.zero_grad()  # zeroes the gradient buffers of all parameters"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The value for the `loc` paramater of the approximating distribution of \n``conv1.bias`` zeroing the gradients is\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(net.conv1.posterior.bias.loc.grad)\n\nloss.backward()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "after calling backward the value is\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(net.conv1.posterior.bias.loc.grad)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**The only thing left to learn is:**\n\n  - Updating the weights of the network\n\n## Update the weights\nThe simplest update rule used in practice is the Stochastic Gradient\nDescent (SGD):\n\n     ``weight = weight - learning_rate * gradient``\n\nWe can implement this using simple python code:\n\n.. code:: python\n\n    learning_rate = 0.01\n    for f in net.parameters():\n        f.data.sub_(f.grad.data * learning_rate)\n\nHowever, as you use neural networks, you want to use various different\nupdate rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.\nTo enable this, `torch` built a small package: ``torch.optim`` that\nimplements all these methods. Using it is very simple:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch.optim as optim\n\n# create your optimizer\noptimizer = optim.SGD(net.parameters(), lr=0.01)\n\n# in your training loop:\nn_batch_epoch = 10  # number of batches per epoch usually len(dataloader)\noptimizer.zero_grad()  # zero the gradient buffers\nborch.sample(net)\noutput = net(input)\nloss = infer.vi_loss(**borch.pq_to_infer(net), kl_scaling=1 / n_batch_epoch)\nloss.backward()\noptimizer.step()  # Does the update"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Exercises\n1) The neural network package contains various modules and loss functions\n   that form the building blocks of deep neural networks. Have a look at the\n   documentation to see what is available.\n\n2) Try designing yor own feed forward networks with two different types of\n   non lineareties ex. relu\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}