Pytorch

# Pytorch

---

## Deep learning toolkits

- 2012: many of them, some academics: Theano, mxnet, caffe, keras...
- Only 2 remains: Tensorflow, Pytorch
    - Pytorch more used in research
        - easier, more intuitive
        - faster experimental iterations
    - Tensorflow more used in industry
        - inertia of industry
        - more scalable

---

## Deep learning toolkits

---

## A primer on tensors

- Tensor dimensions are noted $(n,m,p,q)$
- Element-wise ops:
    - addition, subtraction
    - Hadamard product: torch.mul()
    - division

---

## A primer on tensors

- dot product
```
torch.dot(torch.tensor([4,2]), torch.tensor([3,1])) = [14]
```

- matrix-vector product
```
torch.mv(a,x)
```

- matrix-matrix product
```
torch.mm(a,b)
```

---

## A primer on tensors

- batch matrix-matrix product
```
torch.matmul(a,b)
```
- first dimensions = *batch* dimensions
- ex: $(j,1,n,m) \times (k,m,p) \rightarrow (j,k,n,p)$

- batch must be *broadcasted* dimensions: start comparing dims from right to left: they must be equal, one of them is 1, or one of them does not exist

---

## A primer on tensors

- view
```
torch.randn(4,3,5).view(12,5)
```

- (un)squeeze: add/del dimensions of size 1

- concatenate, split, slice... : https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/

---

## Exercices Tensors

- Use torch.randn to create two tensors of size a=(4, 3, 2) and b=(2).
- Multiply each of the 4 (batch dimension) matrices of a with b, and find the argmax element (amongst the 3)
- Look at the pytorch documentation at https://pytorch.org/docs/stable/index.html to correctly use the tensor operators !

---

## Backpropagation

- Objective: given a function (loss) $f(x)$, compute its gradient $\nabla f(x)$

- Complex $f(x)$ can be decomposed using intermediary variables:
$$f(x,y,z)=(x+y)^2z$$
is better computed as:
- $q=x+y$
- $h=q^2$
- $f=h\times z$

- Principle: back-propagate the final gradient locally, through each intermediary variable, thanks to the chain rule

---

## Interpretation as a circuit diagram (or computational graph)

![](images/backprop.png)

---

## Interpretation as a circuit diagram (or computational graph)

- *Forward pass*: computes values from inputs to outputs (in green)
- *Backward pass*: recursively applies chain rule to compute gradients (in red) from the end to the inputs

Every gate in the circuit computes its local gradient with respect to its output value.

---

## Backprop is a local process

- Example: the "+" gate: $x + y = q$
    - It is equiped with local derivative:
        - To its first input: $\frac {dq}{dx}=1$
        - To its second input: $\frac {dq}{dy}=1$

- Assume it is given a gradient value $\frac {df}{dq}$ at its output:
    - It passes it to its first input by multiplying it with its local derivative: $\frac {df}{dq} \times \frac {dq}{dx} = \frac {df}{dx}$
    - It passes it to its second input by multiplying it with its local derivative: $\frac {df}{dq} \times \frac {dq}{dy} = \frac {df}{dy}$

---

## Sigmoid neuron example

$$f(w,x)=\frac 1 {1+e^{-(w\_0 x\_0 + w\_1 x\_1 + w\_2)}}$$

Exercice:

- Draw a computation graph
- Write the forward values with $w0=2$,$x0=-1$,$w1=-3$,$x1=-2$ and $w2=-3$
- Write the backward values

---

## Sigmoid neuron example

The sigmoid is made up of multiple gates:

$$f(x)=\frac 1 x ~~ \rightarrow ~~ \frac {df}{dx} = -1/x^2$$
$$f\_c(x)=c+x ~~ \rightarrow ~~ \frac{df}{dx} = 1$$
$$f(x)=e^x ~~ \rightarrow ~~ \frac{df}{dx} = e^x$$
$$f\_a(x)=ax ~~ \rightarrow ~~ \frac{df}{dx} = a$$

---

## Sigmoid neuron example

![](images/gates.png)

---

## Sigmoid can also be derived directly

$$\sigma(x) = \frac 1 {1+e^{-x}}$$

$$\frac {d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2}$$

$$ = \left( \frac{1+e^{-x}-1}{1+e^{-x}}\right) \left(\frac 1 {1+e^{-x}}\right)$$

$$ = (1-\sigma(x))\sigma(x)$$

It's simple enough to be used as a single "gate" in complex networks.

---

## Implementation

```
w=[2,-3,-3]     # assume some random weights and data
x=[-1,-2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function

# backward pass through the neuron (backpropagation)
ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit
```

Staged backpropagation: add intermediary variable *ddot* to break down computation

---

## Another example

$$f(x,y)=\frac {x+\sigma(y)}{\sigma(x)+(x+y)^2}$$

- Deriving this function would give very complex expressions.

- But we don't need it: we can evaluate the gradient without them.

```
x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done!                                 #(8)
```

---

## Backward implementation

```
# backprop f = num * invden
dnum = invden # gradient on numerator                             #(8)
dinvden = num                                                     #(8)
# backprop invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# backprop xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below  #(3)
# backprop num = x + sigy
dx += (1) * dnum                                                  #(2)
dsigy = (1) * dnum                                                #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)
```

---

## Understanding gradient in gates

![](images/gates.png)

- The *add* gate distributes evenly the gradient to all of its inputs
- The *max* gate routes the gradient to its highest input
- The *mult* gate switches its inputs and multiplies each by the gradient

Consequence for linear classifiers $w^Tx$: if you multiply input by 1000, the gradient on the weights will be 1000 times larger;
preprocessing hence matters a lot !

---

## Additional material

- see http://cs231n.stanford.edu/vecDerivs.pdf
- see https://arxiv.org/abs/1502.05767

---

# Autodiff and backpropagation
<p style="height: 1cm">

Refs:

- Yann LeCun. (1988) A Theoretical Framework from Back-Propagation.
- https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule
- http://conal.net/papers/beautiful-differentiation/

---

## Autodiff

Autodiff = set of techniques to numerically evaluate the derivative of a function specified by a program

---

## Et si f(x) n'est pas dérivable ?

- Sous-gradients
- REINFORCE (Williams, 1992)
- Reparametrization trick: Ex: Variational Auto-Encoder:

---

## Magie du deep learning

- Optimiser une fonction presque *quelconque* ??!!
- Machine learning: convexité, nb de paramètres, convergence bounds...

- Pourquoi cela marche ?
    - Régularisation généralisée
    - SGD

---

# Loss and regularization

---

## Cross-entropy

Entropy of a probability distribution = nb of bits required to encode the distribution
$$H(p)=-\sum_i p_i \log p_i$$

Cross-entropy between 2 distributions = nb of bits required to encode one distribution, when we only know another distribution
$$H(p,q)=-\sum_i p_i \log q_i$$

Kullback-Leibler divergence = Xent - ent
$$KL(p||q)= \sum_i p_i \log \frac {p_i}{q_i}$$

---

## Loss Regularization

- Naive DNN overfits $\rightarrow$ we want to reduce variance

- This can be done by adding a term to the loss that limits the space where parameters can live:
    - L2 regularization is the most common:
$$loss = E[H(p,q)] + \lambda \sum_i \theta_i^2$$
    - L2 can be interpreted as adding a Gaussian prior

---

## Other regularizations

- L1
$$loss = E[H(p,q)] + \lambda \sum_i |\theta_i|$$

---

## Other regularizations

- Dropout
    - Randomly remove $\lambda$% input and hidden neurons

---

## Other regularizations

- DropConnect

---

## Other regularizations

- Artificial data expansion: Replicate dataset by adding $\lambda$% noise

- SGD: Tune mini-batch size

---

## Putting it all together

Each layer of a feedfoward classifier is composed of:

- Linear parameters: weight matrix + bias vector

- An activation function (relu)

- An optional dropout "layer"

The output layer is composed of:

- Linear parameters: weight matrix + bias vector

- A softmax activation

The training loss = cross-entropy + L2

---

## Computation graph for MLP

- 1-hidden layer MLP: 2 parameter matrices
- Regularized neg-loglike loss:

$$ J = J\_{MLE} + \lambda \left( \sum\_{i,j} (W\_{i,j})^2 + \sum\_{i,j}(W'\_{i,j})^2 \right)$$

---

## Computation graph for MLP

---

## Weights initialization

Why initialization is important ?

- If the weights start too small, the signal shrinks as it pass through until it's too tiny to be useful
- If the weights start too large, the signal grows as it pass through until it's too large to be useful

So we want the variance of the inputs and outputs to be similar !

---

## Xavier initialization

For each neuron with $n_{in}$ inputs:

- Initialize its weights $W$ from a zero-mean Gaussian with

$$Var(W)=\frac 1 {n_{in}}$$

Refs: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization

---

## Glorot initialization

For each neuron with $n\_{in}$ inputs, going into $n\_{out}$ neurons:

- Initialize its weights $W$ from a zero-mean Gaussian with

$$Var(W)=\frac 2 {n\_{in} + n\_{out}}$$

---

## Batch normalization

- You *must* normalize your inputs
- Transforms the activations of the previous layer at each batch, so that the mean=0 and standard deviation=1
- May have parameters to further scale + shift
    - https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
    - http://pytorch.org/docs/master/nn.html?highlight=batchnorm1d#torch.nn.BatchNorm1d

---

## A lot of tricks

MLP, simple ?

- Find a good topology, residual connections, parameter sharing...
- Find a good SGD variant
- Tune learning rate, nb of neurons...
- Tune regularization strength, dropout...
- Tune batch normalization
- Tune nb of epochs, batch size...

Andrew Ng: "If you tune only one thing, tune the learning rate !"

---

## Exercice computation graph

Let us consider the function

$$f(x,y,z) = x \cdot (y^3) + \left( | 2+xz | \right)^2$$

- Draw its circuit diagram
- Compute its forward values at the point $x=1, y=2, z=3$. Write these values on top of the arrows.
- Compute its backward gradient. Write these gradients below the arrows.
- Write the pytorch program that computes this function and its gradient with pytorch automatic differentiation facilities
and prints the value of $\frac{\partial f}{\partial x}$

---

## Exercices: boolean logic

### logical XOR

- Implements a feedforward network with 1 layer.
- Observations = 00:0 01:1 10:1 11:0
- Initial weights = 0
- What happens ?
- Same thing with 2 layers

---

## Exercices: inputs normalization

## unnormalized AND

- Observations = 00:0 02:0 20:0 22:1
- Initial weights = 0

- How many epochs is needed for convergence ?
- What are the weights of the trained model ?
- Compare with Observations = 00:0 01:0 10:0 11:1