Refs: - Yann LeCun. (1988) A Theoretical Framework from Back-Propagation. - https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule - http://conal.net/papers/beautiful-differentiation/ --- ## Autodiff Autodiff = set of techniques to numerically evaluate the derivative of a function specified by a program --- ## Et si f(x) n'est pas dérivable ? - Sous-gradients - REINFORCE (Williams, 1992) - Reparametrization trick: Ex: Variational Auto-Encoder: --- ## Magie du deep learning - Optimiser une fonction presque *quelconque* ??!! - Machine learning: convexité, nb de paramètres, convergence bounds... - Pourquoi cela marche ? - Régularisation généralisée - SGD --- # Loss and regularization --- ## Cross-entropy Entropy of a probability distribution = nb of bits required to encode the distribution $$H(p)=-\sum_i p_i \log p_i$$ Cross-entropy between 2 distributions = nb of bits required to encode one distribution, when we only know another distribution $$H(p,q)=-\sum_i p_i \log q_i$$ Kullback-Leibler divergence = Xent - ent $$KL(p||q)= \sum_i p_i \log \frac {p_i}{q_i}$$ --- ## Loss Regularization - Naive DNN overfits $\rightarrow$ we want to reduce variance - This can be done by adding a term to the loss that limits the space where parameters can live: - L2 regularization is the most common: $$loss = E[H(p,q)] + \lambda \sum_i \theta_i^2$$ - L2 can be interpreted as adding a Gaussian prior --- ## Other regularizations - L1 $$loss = E[H(p,q)] + \lambda \sum_i |\theta_i|$$ --- ## Other regularizations - Dropout - Randomly remove $\lambda$% input and hidden neurons --- ## Other regularizations - DropConnect --- ## Other regularizations - Artificial data expansion: Replicate dataset by adding $\lambda$% noise - SGD: Tune mini-batch size --- ## Putting it all together Each layer of a feedfoward classifier is composed of: - Linear parameters: weight matrix + bias vector - An activation function (relu) - An optional dropout "layer" The output layer is composed of: - Linear parameters: weight matrix + bias vector - A softmax activation The training loss = cross-entropy + L2 --- ## Computation graph for MLP - 1-hidden layer MLP: 2 parameter matrices - Regularized neg-loglike loss: $$ J = J\_{MLE} + \lambda \left( \sum\_{i,j} (W\_{i,j})^2 + \sum\_{i,j}(W'\_{i,j})^2 \right)$$ --- ## Computation graph for MLP --- ## Weights initialization Why initialization is important ? - If the weights start too small, the signal shrinks as it pass through until it's too tiny to be useful - If the weights start too large, the signal grows as it pass through until it's too large to be useful So we want the variance of the inputs and outputs to be similar ! --- ## Xavier initialization For each neuron with $n_{in}$ inputs: - Initialize its weights $W$ from a zero-mean Gaussian with $$Var(W)=\frac 1 {n_{in}}$$ Refs: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization --- ## Glorot initialization For each neuron with $n\_{in}$ inputs, going into $n\_{out}$ neurons: - Initialize its weights $W$ from a zero-mean Gaussian with $$Var(W)=\frac 2 {n\_{in} + n\_{out}}$$ --- ## Batch normalization - You *must* normalize your inputs - Transforms the activations of the previous layer at each batch, so that the mean=0 and standard deviation=1 - May have parameters to further scale + shift - https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html - http://pytorch.org/docs/master/nn.html?highlight=batchnorm1d#torch.nn.BatchNorm1d --- ## A lot of tricks MLP, simple ? - Find a good topology, residual connections, parameter sharing... - Find a good SGD variant - Tune learning rate, nb of neurons... - Tune regularization strength, dropout... - Tune batch normalization - Tune nb of epochs, batch size... Andrew Ng: "If you tune only one thing, tune the learning rate !" --- ## Exercice computation graph Let us consider the function $$f(x,y,z) = x \cdot (y^3) + \left( | 2+xz | \right)^2$$ - Draw its circuit diagram - Compute its forward values at the point $x=1, y=2, z=3$. Write these values on top of the arrows. - Compute its backward gradient. Write these gradients below the arrows. - Write the pytorch program that computes this function and its gradient with pytorch automatic differentiation facilities and prints the value of $\frac{\partial f}{\partial x}$ --- ## Exercices: boolean logic ### logical XOR - Implements a feedforward network with 1 layer. - Observations = 00:0 01:1 10:1 11:0 - Initial weights = 0 - What happens ? - Same thing with 2 layers --- ## Exercices: inputs normalization ## unnormalized AND - Observations = 00:0 02:0 20:0 22:1 - Initial weights = 0 - How many epochs is needed for convergence ? - What are the weights of the trained model ? - Compare with Observations = 00:0 01:0 10:0 11:1